# Do Men and Women Sportscasters Talk about Sports Differently?
Awhile ago I was reading a few articles on women's sports and analysis thereof and ran across [this post][1]. I couldn't agree more that there is a serious lack of data on women's sports and thought I would try to address that to the extent that I could. As the article points out, there are some cool grassroots [efforts][2] and I'd like to contribute to these at some point, though moving to Kenya kind of put that on hold. In the meantime though, I thought I could contribute to the lack of data on women's sports by looking at whether we talk about them differently. Having recently been reading [Unusual Efforts][3], I found there writing to be very interesting because so much of it focused on how the writers relate to soccer (though there is certainly quality analysis as well). Now, clearly the point of the site is different than say, [statsbomb][4] or [whoscored][5], but it made me think that maybe journalists covering sports talked about women's and men's sports differently and that maybe women and men journalists used different verbiage. 

I brought this idea to Adam and he agreed that it would be interesting to check out these potential differences - and he's pretty interested in text-mining. So this project will be analyzing the twitter feeds for various sports journalists. Adam and I aren't completely sure on what type of analysis we'll do, but the goal is to make some cool visualizations and to see if there are any differences. Of course, this may be a pretty shallow analysis given that (1) we are focusing on a narrow subset of the population who are in the public eye and thus their behavior may all tend towards some average and (2) this is twitter rather than looking at journal pieces or actual commentary so the language used and character constraints may be too shallow to draw out differences. 

## This post is about getting the data
Well for any analysis you need data, so this post is all about grabbing the twitter data using [tweepy][6]. There's a couple of different programs to use to grab the data, Adam and I just landed on tweepy cause we saw a [good blog post][7] using it, and then it turned out to have the get_all_tweets() function which was pretty great. Overall, grabbing the data was much easier than my NBA scraping experience.

[1]: https://howwegettonext.com/women-are-being-left-behind-by-the-sports-data-revolution-db87cdb65f57#.2ws69oxq4
[2]: https://wosostats.wordpress.com/
[3]: http://www.unusualefforts.com/
[4]: https://statsbomb.com
[5]: https://www.whoscored.com/
[6]: http://docs.tweepy.org/en/v3.5.0/cursor_tutorial.html
[7]: https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/

In [104]:
# Necessary imports for getting all this data
import tweepy
from tweepy import OAuthHandler
import json
import sqlite3
import re
import pandas as pd
import csv

## SQL in Python
I've been taking some online courses through Coursera to learn more Python data science stuff and one of the projects was to write a program that mined a huge text file and to then populate a SQLite table. Quite a bit of the work in the course recently has focused on using relational databases in SQL to save on space and the integration between Python and SQL. Having not used SQL really at all prior to the course, seeing how clean SQL code is even compared to something like Stata is pretty striking. Anyway, I thought I'd try out writing this data into a SQL database just to do it myself, thus the code below. Adam prefers just to run the analysis through a csv/pandas df because of issues with his computer. Below is me just setting up the database and populating one table with the twitter handles we are going to use.

In [169]:
# Creating a sql table of twitter handles we are interested in
conn = sqlite3.connect('Dropbox/Python/jupyter-blog/content/Twitter_soccer/twitter_sports.sqlite')
conn.text_factory = str
cur = conn.cursor()

cur.executescript("""
DROP TABLE IF EXISTS User;
DROP TABLE IF EXISTS Tweet_sports;

CREATE TABLE User (
    id  INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    Name TEXT UNIQUE,
    Handle   TEXT UNIQUE,
    Sex TEXT
);
""")

<sqlite3.Cursor at 0x1263ac260>

In [170]:
cur.executescript('''
INSERT INTO User (Name, Handle, Sex) VALUES ("The Equalizer", "EqualizerSoccer", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Alexi Lalas", "AlexiLalas", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Eric Wynalda", "EricWynalda", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Taylor Twellman", "TaylorTwellman", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Kate Markgraf", "katemarkgraf", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Julie Foudy", "JulieFoudy", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Brad Friedel", "friedel_b", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Julie Stewart-Binks", "JSB_FOX", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Kyle Martino", "kylemartino", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Brandi Chastain", "brandichastain", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Aly Wagner", "alywagner", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Abby Wambach", "AbbyWambach", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Kate Abdo", "kate_abdo", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Gabriel Marcotti", "Marcotti", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Grant Wahl", "GrantWahl", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Maximilano Bretos", "mbretosESPN", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Rob Stone", "RobStoneONFOX", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Warren Barton", "warrenbarton2", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Arlo White", "arlowhite", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Alejandro Moreno", "AleMorenoESPN", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Robbie Mustoe", "robbiemustoe", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Steve Bower", "SteveBowercomm", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Robbie Earle & Robbie Mustoe", "The2RobbiesNBC", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Keith Costigan", "KeithCostigan", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Cat Whitehilll", "catwhitehill4", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Bob Ley", "BobLeyESPN", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Tony DiCicco", "tonysocc", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Mark Jackson", "MarkJackson13", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Hannah Storm", "HannahStormESPN", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Doris Burke", "heydb", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Michael Smith", "michaelsmith", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Jemele Hill", "jemelehill", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Jalen Rose", "JalenRose", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Sage Steele", "sagesteele", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Michelle Beadle", "MichelleDBeadle", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Tracy McGrady", "Real_T_Mac", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("LaChina Robinson", "LaChinaRobinson", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Stephen Bardo", "stephenbardo", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Lindsay Czarniak", "lindsayczarniak", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Mark Jones", "MarkJonesESPN", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Holly Rowe", "sportsiren", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Heather Cox", "HeatherCoxNBC", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Marc Kestecher", "marckestecher", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Rebecca Lobo", "RebeccaLobo", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Gene Wojciechowski", "GenoEspn", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Dave Pasch", "DavePasch", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Scott van Pelt", "notthefakeSVP", "M");
INSERT INTO User (Name, Handle, Sex) VALUES ("Adriana Monsalve", "AdrianaMonsalve", "F");
INSERT INTO User (Name, Handle, Sex) VALUES ("Antonietta Collins", "AntoniettaESPN", "F");
''')

<sqlite3.Cursor at 0x1263ac260>

## Creating a list to use later
From the database above, I can create a list of the twitter handles and the sex of the twitter user to use in the get_all_tweets function.

In [171]:
result = cur.execute('''SELECT Handle, Sex FROM User''')
a_list = result.fetchall()

## The call returns a list of tuples,
Which is pretty nice cause the tuple can be iterated over easier

In [164]:
type(a_list[0])

tuple

In [178]:
len(a_list)

49

In [172]:
a_list

[('EqualizerSoccer', 'F'),
 ('AlexiLalas', 'M'),
 ('EricWynalda', 'M'),
 ('TaylorTwellman', 'M'),
 ('katemarkgraf', 'F'),
 ('JulieFoudy', 'F'),
 ('friedel_b', 'M'),
 ('JSB_FOX', 'F'),
 ('kylemartino', 'M'),
 ('brandichastain', 'F'),
 ('alywagner', 'F'),
 ('AbbyWambach', 'F'),
 ('kate_abdo', 'F'),
 ('Marcotti', 'M'),
 ('GrantWahl', 'M'),
 ('mbretosESPN', 'M'),
 ('RobStoneONFOX', 'M'),
 ('warrenbarton2', 'M'),
 ('arlowhite', 'M'),
 ('AleMorenoESPN', 'M'),
 ('robbiemustoe', 'M'),
 ('SteveBowercomm', 'M'),
 ('The2RobbiesNBC', 'M'),
 ('KeithCostigan', 'M'),
 ('catwhitehill4', 'F'),
 ('BobLeyESPN', 'M'),
 ('tonysocc', 'M'),
 ('MarkJackson13', 'M'),
 ('HannahStormESPN', 'F'),
 ('heydb', 'F'),
 ('michaelsmith', 'M'),
 ('jemelehill', 'F'),
 ('JalenRose', 'M'),
 ('sagesteele', 'F'),
 ('MichelleDBeadle', 'F'),
 ('Real_T_Mac', 'M'),
 ('LaChinaRobinson', 'F'),
 ('stephenbardo', 'M'),
 ('lindsayczarniak', 'F'),
 ('MarkJonesESPN', 'M'),
 ('sportsiren', 'F'),
 ('HeatherCoxNBC', 'F'),
 ('marckestecher'

## Sub-lists to deal with time out issues
In running the function to grab the tweets, I got timed out through a break in the connection to twitter which is apparently common when you're running quite a bit of code, even if you're observing twitter's rate limits. I left the error in the code below to show it, but that's why I've created this sublists.

In [173]:
b_list = a_list[17:32]
c_list = a_list[32:]

## Unneeded for some reason now?
Originally, when I pulled the twitter handles and sex from the SQL table it returned a list of tuples but where the tuples were in the form (u'handle', u'sex'). To deal with that, I changed the tuples to a list first, then used a short regular expression below to extract what I wanted before flipping it back to a tuple. Not sure why, but at some point I ran the code and it just returned the tuples without the unneeded u'' stuff.

In [None]:
names = []
for i in range(0,len(a_list)):
    handle = re.findall("\(u'(.*)', u'(.*)'\)", a_list[i])
    print handle
    names.append(handle)
names

In [127]:
cd

/Users/rorypulvino


## Thank you to [yanofsky][1]
Adam and I started out by trying to write our own scripts to grab the tweets we wanted. We wanted initially to grab tweets from specific periods in time, such as during the women's world cup or the Olympics when sportscasters were more likely to be tweeting about both men and women's sports. Unfortunately, twitter doesn't let you do this. But because this is where we started, we didn't look for someone else's function in a very effective way at first. We had some initial success as well and tweepy is pretty easy to use, but after about a day of playing around I decided it'd be better to just search out someone that had likely done this in a cleaner fashion than I or Adam was likely to ever do. I quickly found yanofsky's get_all_tweets function, downloaded the script and started modifying it for my needs. Below is just me loading it up and then running it on my own twitter handle to show how it works.
[1]: https://gist.github.com/yanofsky/5436496

In [153]:
# Getting tweet grab function
% run Dropbox/Python/jupyter-blog/content/tweet_dumper.py

In [118]:
get_all_tweets("rorypul", "M")

getting tweets before 539173446182526976
...10 tweets downloaded so far


## Altering the function
I altered the get_all_tweets() function in a number of ways that I've posted to a GitHubGist of the code. Mainly, I altered it to grab favorite count, retweet count, hashtags and user mentions. Getting hashtags and user mentions was more difficult and involved writing in a number of ugly try-except statements into the function since the hashtags and user mentions are nested keys inside a list of dictionary's inside another dictionary. Because of this, when a tweet did not have a hashtag or a user mention the function blew up with an index error since the thing didn't exist, thus the try-except statements.

In [119]:
df = pd.DataFrame()
df = df.append(pd.read_csv('rorypul_tweets.csv'))
df

Unnamed: 0,Twitter_Handle,Sex,Tweet_id,Created_at,Text,Favorite_count,Retweet_count,Hashtags,User_mentions
0,rorypul,M,730921414178439169,2016-05-13 00:43:39,"RT @poverty_action: New study in Science: ""Pos...",0,12,,poverty_action
1,rorypul,M,726062391134281728,2016-04-29 14:55:38,RT @landportal: Good Read: Reshaping the Debat...,0,14,landrights,landportal
2,rorypul,M,639474464388988929,2015-09-03 16:26:07,Mobile technology for non-judicial grievance m...,0,0,,
3,rorypul,M,628239016417185792,2015-08-03 16:20:28,The Land Battle: 15 Organizations Defending La...,0,0,,
4,rorypul,M,571787355134291969,2015-02-28 21:41:43,#removeterranceburk,0,1,removeterranceburk,
5,rorypul,M,571787241665781760,2015-02-28 21:41:16,"RT @IEDPBrazil: When in Brazil, #GoBlue! Thank...",0,6,GoBlue,IEDPBrazil
6,rorypul,M,571782596335677441,2015-02-28 21:22:48,RT @IEDPBrazil: A gente acaba de chegar em Bra...,0,2,,IEDPBrazil
7,rorypul,M,539889222996348928,2014-12-02 21:09:56,I just made a donation to U-M for #GivingBlued...,2,2,GivingBlueday,IEDPBrazil
8,rorypul,M,539782469382574081,2014-12-02 14:05:44,"IPSA should receive an extra $1,000 today for ...",1,1,GivingBlueday,UMichStudents
9,rorypul,M,539173446182526977,2014-11-30 21:45:41,@IEDPBrazil http://t.co/5QZNk7ra65 Brazil's ne...,0,1,,IEDPBrazil


## Error: 'ascii' codec can't encode character u'\xf9' in position 1
The first time I tried to run the loop below, I got this error, which seemed way over my head. Luckily, since I had been working with writing to the SQL table recently, I had been running across ascii and utf-8 errors, so understood this related to the code of the entries. Fixing it was not so easy though. At first I tried to just quickly encode all outtweets array in the get_all_tweets function as utf-8, but kept running into an AttributeError: 'list' object has not attribute 'encode'. This kind of infuriated me since most of the answers I ran across on stackoverflow made it seem easy to just coerce a list to utf-8 using .encode('utf-8'). After spending 20 minutes pointlessly trying different iterations of the encode, I went to problem solving using try-except statements to look at where the error was occurring since it didn't occur for the first few twitter handles.

In [155]:
df = pd.DataFrame()
for (name, sex) in a_list:
    get_all_tweets(name, sex)
    df = df.append(pd.read_csv('%s_tweets.csv' % name))

<type 'list'>
<type 'list'>
<type 'list'>
<type 'list'>
<type 'list'>
<type 'list'>
<type 'list'>
<type 'list'>


TweepError: Failed to send request: HTTPSConnectionPool(host='api.twitter.com', port=443): Read timed out.

In [167]:
for (name, sex) in b_list:
    get_all_tweets(name, sex)
    df = df.append(pd.read_csv('%s_tweets.csv' % name))

<type 'list'>
<type 'list'>


In [174]:
for (name, sex) in c_list:
    get_all_tweets(name, sex)
    df = df.append(pd.read_csv('%s_tweets.csv' % name))

<type 'list'>
<type 'list'>
<type 'list'>


In [175]:
df.head()

Unnamed: 0,Twitter_Handle,Sex,Tweet_id,Created_at,Text,Favorite_count,Retweet_count,Hashtags,User_mentions
0,EqualizerSoccer,F,795463504266481665,2016-11-07 03:10:53,.@GopherSoccer wins first Big 10 Tournament ti...,19,1,,GopherSoccer
1,EqualizerSoccer,F,795432981963874304,2016-11-07 01:09:35,The #NCAA tournament selection show is tomorro...,15,6,NCAA,
2,EqualizerSoccer,F,795427018062106624,2016-11-07 00:45:54,RT @GatorsSoccer: That's using your head!\n\nH...,0,23,,GatorsSoccer
3,EqualizerSoccer,F,795413198606389249,2016-11-06 23:50:59,.@GatorsSoccer defeated @RazorbackSoccer 2-1 i...,15,3,SECChampionship,GatorsSoccer
4,EqualizerSoccer,F,795410719965609988,2016-11-06 23:41:08,RT @GatorsSoccer: Here's a look at #Gators ope...,0,11,Gators,GatorsSoccer


In [176]:
df.tail()

Unnamed: 0,Twitter_Handle,Sex,Tweet_id,Created_at,Text,Favorite_count,Retweet_count,Hashtags,User_mentions
271,AntoniettaESPN,F,789070332191731712,2016-10-20 11:46:42,ICYMI: #Messi leads the #UCL w/ 6 goals this s...,2,1,Messi,Cristiano
272,AntoniettaESPN,F,789060136664985600,2016-10-20 11:06:11,"What we learned: Indians pitch way into WS, Cu...",2,2,,dschoenfield
273,AntoniettaESPN,F,789060071263199240,2016-10-20 11:05:55,RT @jaysonst: Dave Roberts says one more time ...,0,43,,jaysonst
274,AntoniettaESPN,F,788931024524873732,2016-10-20 02:33:08,I can only imagine what he is thinking. 😬 htt...,1,0,,
275,AntoniettaESPN,F,788912540139094016,2016-10-20 01:19:41,@RyanRosenblatt maybe at times switch it up wi...,0,0,,RyanRosenblatt


In [177]:
# Checking to make sure I got all the twitter handles into the df
len(pd.unique(df.Twitter_Handle.ravel()))

49

## Got 'em all
The length of the list of unique twitter handles matches the length of my original list, a_list, so I know I got everyone's feed. 

In [179]:
cd

/Users/rorypulvino


In [180]:
cd Dropbox/Python/jupyter-blog/content/Twitter_soccer

/Users/rorypulvino/Dropbox (Personal)/Python/jupyter-blog/content/Twitter_soccer


In [181]:
df.to_csv('Tweets_sports.csv')

## Writing to a SQL table from a df is very easy
The to_sql command makes writing the df to a SQLite table very easy and saves on having to write out execute commands to insert variables into the table and having to commit constantly. The 'Created_at' column causes problems because of it's type as well (or at least when I was 

In [182]:
df['Created_at'] = pd.to_datetime(df['Created_at'], errors='coerce')
df.to_sql(name='Tweet_sports', con=conn, if_exists='replace')

## Initial attempts before discovering get_all_tweets()
When Adam and I started working on this project, we had different ideas of how to grab the tweets, so below is our first attempts at making our own function to grab tweets using tweepy. The first is Adam's, he preferred to dump everything into a pandas DataFrame since something is wrong with his computer (I think that's the reason he gave). I thought it made more sense to put everything into a SQL db, in part because I wanted to experiment more with SQL in python.

In [None]:
########################## Works to create df with columns for each dictionary key, but only gives last 20 tweets for some reason ###########
x = api.user_timeline('alexilalas')

df = pd.DataFrame()
for i in x:
   a = json.dumps(i._json)
   b = pd.read_json(a)
   df = df.append(b.ix[0] )

df

In [None]:
# Grab tweets and dump them into SQLite tables
for status in tweepy.Cursor(api.user_timeline, id = "alexilalas").items(1):
    a = json.dumps(status._json)
    try: js = json.loads(str(a))
    except: js = None
    tweet = js["text"]
    cur.execute('''INSERT INTO `Tweet_sports` (Tweet) 
                    VALUES ( ? )''', ( tweet, ) )

    # Grabs id numbers based on twitter handle
    handle = "alexilalas"
    cur.execute('SELECT id FROM User WHERE id = ? ', (handle, ))
    user_id = cur.fetchone()[0]
    
    # Links id number from User table to user_id in Tweet_sports table
    cur.execute('''INSERT INTO `Tweet_sports` (user_id) 
                    VALUES ( ? )''', ( user_id, ) )
    conn.commit()