#Class 3 - Processing Files with Python
###DATSF 19
####Justin Breucop - 12/7/2015

For a lot of data files in class we'll use functionality from various libraries to process data very quickly. However, for custom files, raw text, and data that is configured in a non-standard way, it is important to be able to extract data in a customized fashion. We'll go through this exercise using only libraries that come with the default python distribution. The first step will be to open the file in sublime.

Let's say that we are curious about the latest release of ScikitLearn, since we are (or soon will be) frequent users. Our goal is to take the raw commits, sort our authors alphabetically and also count the number of contributions they made. Let's first look at the file. You can do this via the command line but for simplicity's sake we can use the Jupyter cell magic.

In [26]:
# For Max/Linux users:
! more ../data/raw_commits.txt

# For windows users:
# ! more ..\data\raw_commits.txt

[?1h=commit da4f480a6adf5fed30a42500fe0e5a21c404ac2a
Author: Andreas Mueller <amueller@nyu.edu>
Date:   Thu Nov 5 14:57:45 2015 -0500

    Fix import of reload for python 3.3

commit 45ef71f2175fe305152e20b1a6095c535b575b84
Author: Andreas Mueller <amueller@nyu.edu>
Date:   Thu Nov 5 14:31:45 2015 -0500

    MAINT version string for 0.17. D'OH

commit 37d18cef59a614661eb5afbadb9f8e1e124d685e
Author: Andreas Mueller <amueller@nyu.edu>
Date:   Wed Nov 4 14:28:25 2015 -0500

    split installation into simple and advanced part

commit 9334274305e8b9ef0273835a8a6b53ed0c1810c0
Author: Andreas Mueller <t3kcit@gmail.com>
Date:   Thu Nov 5 10:36:17 2015 -0500

    skip unstable tests and doctests  on 32bit platform
[K[?1l>

We see that each commit has an Author and a date. We need to be able to read the file line by line and add to a list of authors. Remember to use `with open('<filename>') as <variable>` where `<filename>` is the full path to the file and the `<variable>` is any identifier (such as `f`).

##### Lines of file -> List of Strings

In [1]:
# Open the file and try printing all lines that start with author

list = []

with open('../data/raw_commits.txt', 'r') as cf:
    for line in cf:
        if 'Author'in line:
            line = line.split(': ')[1]
            line = line.split(' <')[0]  
            list.append(line)
        else:
            pass
        
        #if line[:6] == 'Author':
            #list.append(line[8:].split(' <')[0])
        #else:
            #pass

print list

# Make sure to append the author name to the list. You'll need to use string manipulation techniques.

['Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'KamalakerDadi', 'Andreas Mueller', 'Graham Clenaghan', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'trevorstephens', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'trevorstephens', 'Olivier Grisel', 'TomDLT', 'Lo\xc3\xafc Est\xc3\xa8ve', 'Andreas Mueller', 'Ganiev Ibraim', 'Giorgio Patrini', 'MartinBpr', 'Andreas Mueller', 'Arthur Mensch', 'Andreas Mueller', 'Jeffrey04', 'MaryanMorel', 'Arnaud Rachez', 'Olivier Grisel', 'giorgiop', 'Gilles Louppe', 'Andreas Mueller', 'MechCoder', 'Olivier Grisel', 'Olivier Grisel', 'Andreas Mueller', 'giorgiop', 'Olivier Grisel', 'Olivier Grisel', 'Joel Nothman', 'Alexandre Gramfort', 'Andreas Mueller', 'Olivi

Sort the authors to find the first and last authors, alphabetically. Make sure your data is clean! (No username should begin with an = sign, for example)

#####List of Strings -> Sorted unique list

In [41]:
# Think of what data types you can take advantage of
clean_list = []
for Author in list:
    if '=' in Author:
        pass
    else:
        clean_list.append(Author)

sorted_list = sorted(clean_list)
        
print sorted_list

['Aaron Schumacher', 'Adithya Ganesh', 'Adithya Ganesh', 'Adithya Ganesh', 'Adithya Ganesh', 'Adithya Ganesh', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandre Gramfort', 'Alexandr

To count out our data, we can loop over our list and construct a dictionary where the key is the commit author and the value increases whenever we match a key.
#####List -> Dictionary

In [46]:
author_count = {}
for Name in sorted_list:
    if Name in author_count:
        author_count[Name] = author_count[Name] + 1
    else:
        author_count[Name] = 1
        
print author_count

{'Lars': 13, 'sinhrks': 2, 'Rohan Ramanath': 1, 'Gryllos Prokopis': 3, 'Steven Seguin': 4, 'Lilian Besson': 3, 'Hsuan-Tien Lin': 8, 'Thomas Unterthiner': 26, 'Daniel Kronovet': 3, 'Dan Blanchard': 1, 'Andrew Lamb': 5, 'Dougal Sutherland': 1, 'Alexey Grigorev': 1, 'maheshakya': 3, 'Skipper Seabold': 2, 'Yucheng Low': 1, 'Vincent': 1, 'tokoroten': 5, 'Ando Saabas': 1, 'Alexandre Gramfort': 60, 'Vighnesh Birodkar': 7, 'Clyde-fare': 1, 'Boyuan Deng': 3, 'scls19fr': 2, 'Peter Fischer': 1, 'Ganiev Ibraim': 5, 'zhai_pro': 2, 'Kyler Brown': 1, 'Christopher Erick Moody': 1, 'MaryanMorel': 1, 'Tian Wang': 1, 'Stephen Hoover': 3, 'Joshua Loyal': 3, 'Jaidev Deshpande': 1, 'Cindy Sridharan': 3, 'Dougal J. Sutherland': 1, 'Allen Riddell': 1, 'Ari Rouvinen': 1, 'Zac Stewart': 1, 'John Wittenauer': 2, 'Eric Martin': 1, 'Matti Lyra': 2, 'Donne Martin': 1, 'Martin Ku': 2, 'Frank Zalkow': 9, 'edson duarte': 1, 'Jacob Schreiber': 16, 'Joel Nothman': 37, 'mbillinger': 4, 'Manoj Kumar': 5, 'Barmaley.exe': 4

Find the contributor with the highest number of commits. Useful dictionary method: `dict.get()`

#####Dictionary -> Specific String

In [74]:
most_commits = 0
most_commits_author = ''
for i in author_count:
    if author_count[i] > most_commits:
        most_commits = author_count[i]
        most_commits_author = i
    else:
        pass
print most_commits_author
print most_commits


Andreas Mueller
235


Bonus: how do you handle a tie? Can you pull all authors with the lowest number of commits (without hardcoding the minimum).

In [89]:
least_commits_list = {}
least_commits = most_commits
for key in author_count:
    if author_count[key] <= least_commits:
        if author_count[key] < least_commits:
            least_commits = author_count[key]
            least_commits_list = {}
            least_commits_list[key] = author_count[key]
        else:
            least_commits_list[key] = author_count[key]
    else:
        pass
    
print least_commits_list


{'Christof Angermueller': 1, 'Nicolas': 1, 'Rohan Ramanath': 1, 'Tiago Freitas Pereira': 1, 'Yucheng Low': 1, 'John Kirkham': 1, 'Eric Larson': 1, 'Danny Sullivan': 1, 'Chih-Wei Chang': 1, 'Timothy Hopper': 1, 'Tian Wang': 1, 'Jaidev Deshpande': 1, 'Robert Layton': 1, 'Dan Blanchard': 1, 'MartinBpr': 1, 'Dougal Sutherland': 1, 'Alexey Grigorev': 1, 'Sam Zhang': 1, 'Varoquaux': 1, 'Tom Dupr\xc3\xa9 la Tour': 1, 'Jeremy': 1, 'Preston Parry': 1, 'Ari Rouvinen': 1, 'banilo': 1, 'Ando Saabas': 1, 'Giorgio Patrini': 1, 'Valentin Stolbunov': 1, 'Vincent Michel': 1, 'Clyde-fare': 1, 'Tom DLT': 1, 'Jake Vanderplas': 1, 'benjaminirving': 1, 'Jiali Mei': 1, 'Frank C. Eckert': 1, 'Peter Fischer': 1, 'Fernando Carrillo': 1, 'Kyler Brown': 1, 'Shivan Sornarajah': 1, 'Arnaud Rachez': 1, 'Ali Baharev': 1, 'Konstantin Shmelkov': 1, 'akitty': 1, 'Nikolay Mayorov': 1, 'Christopher Erick Moody': 1, 'Jeffrey04': 1, 'saurabh.bansod': 1, 'Yury Zhauniarovich': 1, 'Erich Schubert': 1, 'MaryanMorel': 1, 'JeanKo