# Data Wrangling

Merge and tidy repos and users tables to prepare for visualization.

In [1]:
import re

import pandas as pd

## Prepare Repo Data

Load the repos data and drop duplicates:

In [2]:
repos = pd.read_csv("data/repos-dump.csv", quotechar='"', skipinitialspace=True)
print('Shape before dropping duplicates', repos.shape)
repos = repos.drop_duplicates(subset='full_name', keep='last')
print('Shape after  dropping duplicates', repos.shape)
repos.head()

Shape before dropping duplicates (7060, 5)
Shape after  dropping duplicates (7059, 5)


Unnamed: 0,full_name,stars,forks,description,language
0,facebook/react-native,24783,4198,A framework for building native apps with React.,JavaScript
1,NARKOZ/hacker-scripts,19836,3553,Based on a true story,JavaScript
2,rackt/redux,11612,1180,Predictable state container for JavaScript apps,JavaScript
3,bevacqua/dragula,10737,593,:ok_hand: Drag and drop so simple it hurts,JavaScript
4,zenorocha/clipboard.js,10268,438,:scissors: Modern copy to clipboard. No Flash....,JavaScript


Separate out the `user` and `repo` from `full_name` into new columns:

In [3]:
def extract_user(line):
    return line.split('/')[0]

def extract_repo(line):
    return line.split('/')[1]

repos['user'] = repos['full_name'].str[:].apply(extract_user)
repos['repo'] = repos['full_name'].str[:].apply(extract_repo)
print(repos.shape)
repos.head()

(7059, 7)


Unnamed: 0,full_name,stars,forks,description,language,user,repo
0,facebook/react-native,24783,4198,A framework for building native apps with React.,JavaScript,facebook,react-native
1,NARKOZ/hacker-scripts,19836,3553,Based on a true story,JavaScript,NARKOZ,hacker-scripts
2,rackt/redux,11612,1180,Predictable state container for JavaScript apps,JavaScript,rackt,redux
3,bevacqua/dragula,10737,593,:ok_hand: Drag and drop so simple it hurts,JavaScript,bevacqua,dragula
4,zenorocha/clipboard.js,10268,438,:scissors: Modern copy to clipboard. No Flash....,JavaScript,zenorocha,clipboard.js


## Prepare User Data

Load the users data and drop duplicates:

In [4]:
users = pd.read_csv("data/user-geocodes-dump.csv", quotechar='"', skipinitialspace=True)
print('Shape before dropping duplicates', users.shape)
users = users.drop_duplicates(subset='id', keep='last')
print('Shape after  dropping duplicates', users.shape)
users.head()

Shape before dropping duplicates (5103, 8)
Shape after  dropping duplicates (5103, 8)


Unnamed: 0,id,name,type,location,lat,long,city,country
0,KOWLOR,Malik Dellidj,User,"Lille, France",50.62925,3.057256,Lille,France
1,souporserious,Travis Arnold,User,"San Marcos, CA",33.143372,-117.166145,San Marcos,United States
2,pcqpcq,Joker,User,"Fuzhou, China",26.074508,119.296494,Fuzhou,China
3,ant4g0nist,Chaithu,User,,,,,
4,cs231n,,User,,,,,


Rename column `id` to `user`:

In [5]:
users.rename(columns={'id': 'user'}, inplace=True)
users.head()

Unnamed: 0,user,name,type,location,lat,long,city,country
0,KOWLOR,Malik Dellidj,User,"Lille, France",50.62925,3.057256,Lille,France
1,souporserious,Travis Arnold,User,"San Marcos, CA",33.143372,-117.166145,San Marcos,United States
2,pcqpcq,Joker,User,"Fuzhou, China",26.074508,119.296494,Fuzhou,China
3,ant4g0nist,Chaithu,User,,,,,
4,cs231n,,User,,,,,


## Merge Repo and User Data

Left join repos and users:

In [6]:
repos_users = pd.merge(repos, users, on='user', how='left')
print('Shape repos:', repos.shape)
print('Shape users:', users.shape)
print('Shape repos_users:', repos_users.shape)
repos_users.head()

Shape repos: (7059, 7)
Shape users: (5103, 8)
Shape repos_users: (7059, 14)


Unnamed: 0,full_name,stars,forks,description,language,user,repo,name,type,location,lat,long,city,country
0,facebook/react-native,24783,4198,A framework for building native apps with React.,JavaScript,facebook,react-native,Facebook,User,"Menlo Park, California",37.45296,-122.181725,Menlo Park,United States
1,NARKOZ/hacker-scripts,19836,3553,Based on a true story,JavaScript,NARKOZ,hacker-scripts,Nihad Abbasov,User,"Katowice, Poland",50.264892,19.023782,Katowice,Poland
2,rackt/redux,11612,1180,Predictable state container for JavaScript apps,JavaScript,rackt,redux,,User,,,,,
3,bevacqua/dragula,10737,593,:ok_hand: Drag and drop so simple it hurts,JavaScript,bevacqua,dragula,Nicolás Bevacqua,User,https://twitter.com/nzgb,,,,
4,zenorocha/clipboard.js,10268,438,:scissors: Modern copy to clipboard. No Flash....,JavaScript,zenorocha,clipboard.js,Zeno Rocha,User,"Los Angeles, CA",34.052234,-118.243685,Los Angeles,United States


## Tidy Up Repo and User Data

Re-order the columns:

In [7]:
repos_users = repos_users.reindex_axis(['full_name',
                                        'repo',
                                        'description',
                                        'stars',
                                        'forks',
                                        'language',
                                        'user',
                                        'name',
                                        'type',
                                        'location',
                                        'lat',
                                        'long',
                                        'city',
                                        'country'], axis=1)
print(repos_users.shape)
repos_users.head()

(7059, 14)


Unnamed: 0,full_name,repo,description,stars,forks,language,user,name,type,location,lat,long,city,country
0,facebook/react-native,react-native,A framework for building native apps with React.,24783,4198,JavaScript,facebook,Facebook,User,"Menlo Park, California",37.45296,-122.181725,Menlo Park,United States
1,NARKOZ/hacker-scripts,hacker-scripts,Based on a true story,19836,3553,JavaScript,NARKOZ,Nihad Abbasov,User,"Katowice, Poland",50.264892,19.023782,Katowice,Poland
2,rackt/redux,redux,Predictable state container for JavaScript apps,11612,1180,JavaScript,rackt,,User,,,,,
3,bevacqua/dragula,dragula,:ok_hand: Drag and drop so simple it hurts,10737,593,JavaScript,bevacqua,Nicolás Bevacqua,User,https://twitter.com/nzgb,,,,
4,zenorocha/clipboard.js,clipboard.js,:scissors: Modern copy to clipboard. No Flash....,10268,438,JavaScript,zenorocha,Zeno Rocha,User,"Los Angeles, CA",34.052234,-118.243685,Los Angeles,United States


## Add Overall Ranks

Rank each element based on number of stars:

In [8]:
repos_users['rank'] = repos_users['stars'].rank(ascending=False)
print(repos_users.shape)
repos_users.head()

(7059, 15)


Unnamed: 0,full_name,repo,description,stars,forks,language,user,name,type,location,lat,long,city,country,rank
0,facebook/react-native,react-native,A framework for building native apps with React.,24783,4198,JavaScript,facebook,Facebook,User,"Menlo Park, California",37.45296,-122.181725,Menlo Park,United States,2
1,NARKOZ/hacker-scripts,hacker-scripts,Based on a true story,19836,3553,JavaScript,NARKOZ,Nihad Abbasov,User,"Katowice, Poland",50.264892,19.023782,Katowice,Poland,4
2,rackt/redux,redux,Predictable state container for JavaScript apps,11612,1180,JavaScript,rackt,,User,,,,,,9
3,bevacqua/dragula,dragula,:ok_hand: Drag and drop so simple it hurts,10737,593,JavaScript,bevacqua,Nicolás Bevacqua,User,https://twitter.com/nzgb,,,,,10
4,zenorocha/clipboard.js,clipboard.js,:scissors: Modern copy to clipboard. No Flash....,10268,438,JavaScript,zenorocha,Zeno Rocha,User,"Los Angeles, CA",34.052234,-118.243685,Los Angeles,United States,11


## Verify Results: Users

Equivalent [GitHub search query](https://github.com/search?utf8=%E2%9C%93&q=created%3A2015-01-01..2015-12-31+stars%3A%3E%3D100+user%3Adonnemartin&type=Repositories&ref=searchresults): `created:2015-01-01..2015-12-31 stars:>=100 user:donnemartin`

*Note: The data might be slightly off, as the search query will take into account data up to when the query was executed.  Data in this notebook was mined on January 1, 2016 to 'freeze' the results for the year 2015.  The longer you run the search from January 1, 2016, the larger the discrepancy.*

In [9]:
repos_users[repos_users['user'] == 'donnemartin']

Unnamed: 0,full_name,repo,description,stars,forks,language,user,name,type,location,lat,long,city,country,rank
2761,donnemartin/data-science-ipython-notebooks,data-science-ipython-notebooks,Continually updated data science Python notebo...,3945,623,Python,donnemartin,Donne Martin,User,"Washington, D.C.",38.907192,-77.036871,Washington,United States,80.0
2773,donnemartin/saws,saws,A supercharged AWS command line interface (CLI...,2591,88,Python,donnemartin,Donne Martin,User,"Washington, D.C.",38.907192,-77.036871,Washington,United States,176.0
2777,donnemartin/interactive-coding-challenges,interactive-coding-challenges,"Continually updated interactive, test-driven P...",2121,256,Python,donnemartin,Donne Martin,User,"Washington, D.C.",38.907192,-77.036871,Washington,United States,245.0
2784,donnemartin/awesome-aws,awesome-aws,A curated list of awesome Amazon Web Services ...,1631,96,Python,donnemartin,Donne Martin,User,"Washington, D.C.",38.907192,-77.036871,Washington,United States,353.0
2785,donnemartin/dev-setup,dev-setup,Mac OS X development environment setup: Easy-...,1581,197,Python,donnemartin,Donne Martin,User,"Washington, D.C.",38.907192,-77.036871,Washington,United States,364.5


## Verify Results: Python Repos

Equivalent [GitHub search query](https://github.com/search?utf8=%E2%9C%93&q=created%3A2015-01-01..2015-12-31+stars%3A%3E%3D100+language%3Apython&type=Repositories&ref=searchresults): `created:2015-01-01..2015-12-31 stars:>=100 language:python`

*Note: The data might be slightly off, as the search query will take into account data up to when the query was executed.  Data in this notebook was mined on January 1, 2016 to 'freeze' the results for the year 2015.  The longer you run the search from January 1, 2016, the larger the discrepancy.*

In [10]:
print(repos_users[repos_users['language'] == 'Python'].shape)
repos_users[repos_users['language'] == 'Python'].head()

(553, 15)


Unnamed: 0,full_name,repo,description,stars,forks,language,user,name,type,location,lat,long,city,country,rank
2758,nvbn/thefuck,thefuck,Magnificent app which corrects your previous c...,16449,768,Python,nvbn,Vladimir Iakovlev,User,"Russia, Saint-Petersburg",59.93428,30.335099,Saint Petersburg,Russia,6
2759,minimaxir/big-list-of-naughty-strings,big-list-of-naughty-strings,The Big List of Naughty Strings is a list of s...,9387,381,Python,minimaxir,Max Woolf,User,San Francisco Bay Area,37.827178,-122.291308,,United States,13
2760,XX-net/XX-Net,XX-Net,接力GoAgent翻墙工具----Anti-censor tools,4137,1535,Python,XX-net,XX-Net,User,,,,,,68
2761,donnemartin/data-science-ipython-notebooks,data-science-ipython-notebooks,Continually updated data science Python notebo...,3945,623,Python,donnemartin,Donne Martin,User,"Washington, D.C.",38.907192,-77.036871,Washington,United States,80
2762,fchollet/keras,keras,"Deep Learning library for Python. Convnets, re...",3731,864,Python,fchollet,François Chollet,User,San Francisco,37.774929,-122.419415,San Francisco,United States,90


## Verify Results: Overall Repos

Equivalent [GitHub search query](https://github.com/search?utf8=%E2%9C%93&q=created%3A2015-01-01..2015-12-31+stars%3A%3E%3D100&type=Repositories&ref=searchresults): `created:2015-01-01..2015-12-31 stars:>=100`

*Note: The data might be slightly off, as the search query will take into account data up to when the query was executed.  Data in this notebook was mined on January 1, 2016 to 'freeze' the results for the year 2015.  The longer you run the search from January 1, 2016, the larger the discrepancy.*

In [11]:
print(repos_users.shape)
repos_users.head()

(7059, 15)


Unnamed: 0,full_name,repo,description,stars,forks,language,user,name,type,location,lat,long,city,country,rank
0,facebook/react-native,react-native,A framework for building native apps with React.,24783,4198,JavaScript,facebook,Facebook,User,"Menlo Park, California",37.45296,-122.181725,Menlo Park,United States,2
1,NARKOZ/hacker-scripts,hacker-scripts,Based on a true story,19836,3553,JavaScript,NARKOZ,Nihad Abbasov,User,"Katowice, Poland",50.264892,19.023782,Katowice,Poland,4
2,rackt/redux,redux,Predictable state container for JavaScript apps,11612,1180,JavaScript,rackt,,User,,,,,,9
3,bevacqua/dragula,dragula,:ok_hand: Drag and drop so simple it hurts,10737,593,JavaScript,bevacqua,Nicolás Bevacqua,User,https://twitter.com/nzgb,,,,,10
4,zenorocha/clipboard.js,clipboard.js,:scissors: Modern copy to clipboard. No Flash....,10268,438,JavaScript,zenorocha,Zeno Rocha,User,"Los Angeles, CA",34.052234,-118.243685,Los Angeles,United States,11


## Output Results

Write out the results to csv to visualize in Tableau:

In [12]:
repos_users.to_csv('data/repos-users-geocodes.csv', index=False)
users.to_csv('data/users.csv', index=False)