The following code chunk was run in R before I realized I should be extracting data in Python first

In [1]:
#Combine csv files

# all_place <- readLines("place_only.csv", skipNul = T)
# all_place <- c(all_place, readLines("place_only_2.csv", skipNul = T))
# all_place <- c(all_place, readLines("place_only_3.csv", skipNul = T))
# length(all_place) = 740165
# 
# all_text <- readLines("text_only.csv")
# all_text <- c(all_text, readLines("text_only_2.csv"))
# all_text <- c(all_text, readLines("text_only_3.csv"))
# length(all_text) = 1792511

#I had a feeling that the Python script may not capture all places and text in a fashion that lines up.
#This is why I captured the full tweet data!
#The only problem is that the full tweet data is stored in 18 GBs worth of data, and that I can't figure out how to read it back in as a tweet.
#Since it's very easy to split large files on Windows using Git Bash (https://gitforwindows.org/), I did so for each of my tweet files.
#This is the command that was used on each of the full_tweets#.csv files: "split full_tweets_#.csv tweets# -l 10000 -d"
#This means that all but the last files in each split will have 10000 lines.

In [2]:
#import numpy for easy array accumulation
import numpy as np

In [5]:
#read in list of NFL teams
with open('teams.txt', encoding='utf-8-sig') as team_file:
    teams = [line.strip() for line in team_file]
teams

['Cardinals',
 'Falcons',
 'Ravens',
 'Bills',
 'Panthers',
 'Bears',
 'Bengals',
 'Browns',
 'Cowboys',
 'Broncos',
 'Lions',
 'Packers',
 'Texans',
 'Colts',
 'Jaguars',
 'Chiefs',
 'Chargers',
 'Rams',
 'Dolphins',
 'Vikings',
 'Patriots',
 'Saints',
 'Giants',
 'Jets',
 'Raiders',
 'Eagles',
 'Steelers',
 '49ers',
 'Seahawks',
 'Buccaneers',
 'Titans',
 'Redskins']

In [7]:
#Initialize array for team mention counts to be aggregated in
team_counts = np.zeros((32,), dtype=int)

In [8]:
#read in the list of split tweet files.
#I broke these up because I don't think my computer would know what to do when trying to load 18GB into RAM.
#Using utf-8-sig encoding because when I saved as a UTF-8 encoded file, it saved it as UTF-8-BOM for some reason.
with open("file_names.txt", encoding = "utf-8-sig") as file_names_file:
        file_names = [line.strip() for line in file_names_file]

In [9]:
#This chews through each of the 1/4 GB files until all 18 GBs are processed.
#The list comprehension that counts mentions for each team is pretty quick - just a few seconds for each file once it's read in.
for file in file_names:
    with open(file, encoding = "utf-8-sig") as curr_tweets:
        tweets = curr_tweets.readlines()
    team_counts += np.array([sum([curr_team in curr_tweet for curr_tweet in tweets]) for curr_team in teams])
print(team_counts)

[ 6000  8643 32814 13527  6889  9133 26183 29353 19491 14777 13067 13439
  7734  9305  4189 35295  9735 11517  9878 12614 69738 31115 19026  8459
  7645 38145 25622 43992 13040  4571 16711  7725]


In [10]:
import pandas as pd

In [11]:
nfl_df = pd.DataFrame({'teams':teams, 'mentions':team_counts})
nfl_df.sort_values(by = ['mentions'])

Unnamed: 0,teams,mentions
14,Jaguars,4189
29,Buccaneers,4571
0,Cardinals,6000
4,Panthers,6889
24,Raiders,7645
31,Redskins,7725
12,Texans,7734
23,Jets,8459
1,Falcons,8643
5,Bears,9133


The order of team mention counts above makes sense to me, but also I'm realizing now that most of the statistical assumptions necessary for the Chi-Squared test for equal proportions are violated. These mentions on twitter certainly are not independent, as most of them likely stem from specific actions that occured recently. The independence assumption is definitely violated when retweets are added into the mix. 

I don't believe that the assumption of random sampling is satisfied either, as we are just looking at a snapshot of the season, and these mentions could be skewed due to events that unfolded in just this past weekend of football. However, this can be addressed by limiting the scope of the analysis to just the most popular team in the NFL at the present moment.

In [12]:
nfl_df.to_csv("nfl_team_mentions.csv", index = False)