This cell imports all data in the data file into a list of lists. Each internal list contains the information of a full column. Empty fields are preserved in the lists.

In [1]:
import csv
from pprint import pprint

data = [[] for i in range(7)] 

with open('twcs.csv', 'r',encoding='utf-8') as infile:
    datafile = csv.DictReader(infile)
    for row in datafile:
        i = 0
        for item in row:
            data[i].append(row[item])
            i += 1

Q1: The code below prints the length of the "tweet_id" column. Its result is the number of tweets in the file.

In [2]:
print("There are",len(data[0]),"Tweets in this file.")

There are 2811774 Tweets in this file.


Some tweet_id numbers do not occur within the file; for example, there is no tweet 7,9, or 10 upon initial inspection as shown below. In addition, some tweet information spans two lines in the file, such as tweet_id 34. This doesn't cause issue with analysis, but there are fewer tweets than lines in the file, which is worth noting.

In [3]:
for item in data[0][0:10]:
    print(item)

1
2
3
4
5
6
8
11
12
15


Q2: Some of the tweets contain non-english characters as shown below. This is actually the slowest part of the program. In hind sight, I could have done this differently, but it gets the job done.

In [4]:
print(data[4][data[0].index('269')])

@115770 こんにちは、アマゾン公式です。Fire TV Stickが見れないというのは、どのような状況でしょうか。一般的なトラブルシューティングを記載したヘルプがございますので、ご参照ください。https://t.co/2pbG55qJ7h ET


Q3: Here I determine the fraction of each tweet that is non-ASCII encoded, then count the number of tweets that consist of 50% or more non-ASCII encoded characters. I used this method of testing character ordinance because the original method I tried was too slow. The two methods gave slightly different results, though I understand this approach more thoroughly. My original attempt included the following code that was far too inefficient. What follows is much faster.

        try:  #Snippet of initial attempt
           character.encode('ascii')
            totalChar+=1
        except UnicodeEncodeError:
            nonAscii+=1
        else:
            pass

In [5]:
count = 0
for leString in data[4]:
    totalChar = 0
    nonAscii = 0
    for character in leString:
        totalChar+=1
        if ord(character) > 127:
            nonAscii+=1
    if totalChar == 0:
        pass
    else:
        if nonAscii/totalChar >= 0.5:
            count += 1
            
print("There are",count,"Tweets containing 50% or more non-ASCII characters.")

There are 20245 Tweets containing 50% or more non-ASCII characters.


Q4: Here I count the number of unique twitter names used by checking for an @ symbol at the start of each word, checking that against a running list of names, and adding it to the list if it doesn't exist or increasing a counter if it does. The while loop strips punctuation or other special characters off the end of each name, if applicable, and can handle any number of special characters provided they are not intertwined with alphanumeric characters.

In [6]:
from operator import itemgetter
nameCounts = {}
for leString in data[4]:
    splitString = leString.split()
    for word in splitString:
        if word[0] == '@' and len(word) > 1:
            while word[-1].isalnum() is False and word[-1] is not '_' and len(word)>1:
                word = word[:-1]
            if word in nameCounts:
                nameCounts[word]+=1
            elif len(word) > 1:
                nameCounts.update({word:1})
            
            finalCount = len(nameCounts)
        
print("There are",finalCount,"unique Twitter names in this file")

There are 717692 unique Twitter names in this file


Q5: Using exactly the same code as Q4, I simply need to put the dictionary into a sorted list, then output the last 10 entries.

In [7]:
from operator import itemgetter
nameCounts = {}
for leString in data[4]:
    splitString = leString.split()
    for word in splitString:
        if word[0] == '@' and len(word) > 1:
            while word[-1].isalnum() is False and word[-1] is not '_' and len(word)>1:
                word = word[:-1]
            if word in nameCounts:
                nameCounts[word]+=1
            elif len(word) > 1:
                nameCounts.update({word:1})
                
sortD = sorted(nameCounts.items(),key=itemgetter(1))
                
print("The following list is the top 10 Twitter usernames mentioned:")   
for i in range(10):
    print(sortD[-i-1][0])

The following list is the top 10 Twitter usernames mentioned:
@AmazonHelp
@AppleSupport
@AmericanAir
@Uber_Support
@Delta
@115858
@VirginTrains
@Tesco
@SouthwestAir
@SpotifyCares


Q6: Adapted from the solution to Q4; I search for # instead of @. However, inspection of the data reveals the # is sometimes actually used for its original purpose, to refer to numbers. 

The rules governing hashtags state that they must be alphanumeric and that they aren't case sensitive.

I account for #'s not associated with hashtags by checking that the remainder of the string after # is alphanumeric. I also force the strings into lower case since hashtags are not case sensitive, so #Halloween and #halloween are considered the same hashtag.

Finally, the algorithm accounts for punctuation after a hashtag, such as #halloween!!! in the same manner as before with usernames.

This does not rule out the possibility of someone using it as an account number, like "Here's my account #54345234" but Twitter would likely interpret this as a hashtag anyways.

In [15]:
from operator import itemgetter
nameCounts = {}
for leString in data[4]:
    splitString = leString.split()
    for word in splitString:
        if word[0] == '#' and len(word) > 1:
            testword = word[1:]
            while testword[-1].isalnum() is False and len(testword) > 1:
                word = word[:-1]
                testword = word
            if testword.isalnum():
                if word.lower() in nameCounts:
                    nameCounts[word.lower()]+=1
                else:
                    nameCounts.update({word.lower():1})
                
print("There are",len(nameCounts),"unique Twitter hashtags in this file")

There are 64865 unique Twitter hashtags in this file


Q7: Adapted from the solution to questions 5 and 6; the difference is, again, that hashtags are not case sensitive and must be alphanumeric. I also create a sorted list and the last 10 entries are the most mentioned hashtags. As before, extra punctuation is stripped from the end.

In [9]:
from operator import itemgetter
nameCounts = {}
for leString in data[4]:
    splitString = leString.split()
    for word in splitString:
        if word[0] == '#' and len(word) > 1:
            testword = word[1:]
            while testword[-1].isalnum() is False and len(testword) > 1:
                word = word[:-1]
                testword = word
            if testword.isalnum():
                if word.lower() in nameCounts:
                    nameCounts[word.lower()]+=1
                else:
                    nameCounts.update({word.lower():1})

sortD = sorted(nameCounts.items(),key=itemgetter(1))
                
print("The following list is the top 10 Twitter hashtags mentioned:")   
for i in range(10):
    print(sortD[-i-1][0])

The following list is the top 10 Twitter hashtags mentioned:
#fail
#amazon
#ios11
#iphonex
#customerservice
#apple
#help
#aateam
#hppsdr
#iphone


Q8: Adapted from Q5 and Q7; I only count "words" that contain alphanumeric characters, purposely avoiding hashtags or usernames that start with # or @. I also force all strings into lower case to account for differences in case. Finally, I account for punctuation using the previous method. Plurals and possessive nouns are treated as separate words. I found that accounting for punctuation actually changes this list quite a lot, but that is reasonable given that many customer support tweets probably end similarly. 
"Glad we could help!" "Glad we could help." "glad we could help" etc.

In [14]:
from operator import itemgetter
nameCounts = {}
for leString in data[4]:
    splitString = leString.split()
    for word in splitString:
        while word[-1].isalnum() is False and len(word)>1:
            word = word[:-1]
            if word.lower() in nameCounts:
                nameCounts[word.lower()]+=1
            elif word.isalnum() is True:
                nameCounts.update({word.lower():1})

sortD = sorted(nameCounts.items(),key=itemgetter(1))
                
print("The following list is the top 20 words mentioned in these Tweets:")   
for i in range(20):
    print(sortD[-i-1][0])

The following list is the top 20 words mentioned in these Tweets:
you
help
here
this
there
it
hi
thanks
us
dm
that
out
now
number
me
hello
issue
today
address
account


Some observations:

Dictionaries are much faster than trying to search lists for names. I went from a program that took about 8 hours to run (before I gave up and killed it) to a program that takes about 2.5 minutes (on my desktop). Point taken!

I suppose we shouldn't be surprised by the top 20 words.

I imagine the quality control steps you take can vary the answers to some of these questions quite a bit. I suppose thats what distinguishes an experienced data scientist from an inexperienced one.