# Python Basics

## Data Science Tutorial

Now that we've covered some Python basics, we will begin a tutorial going through many tasks a data scientist may perform.  We will obtain real world data and go through the process of auditing, analyzing, visualing, and building classifiers from the data.

We will use a database of selected disease statistics of various contries from the Global Health Observatory. The data is organized by country and year, with the number of specific incidents of each disease listed. The attributes and domain of each entry are described by the table below:

| Attribute                     | Domain                          |
|-------------------------------|---------------------------------|
| 1. Country                    | String                          |
| 2. Year                       | Year (2009-2014)                |
| 3. T.b. gambiense             | Integer                         |
| 4. T.b. rhodesiense           | Integer                         |
| 5. Cholera                    | Integer                         |
| 6. Meningitis (suspected)     | Integer                         |
| 7. Congenital Rubella         | Integer                         |
| 8. Diphtheria                 | Integer                         |
| 9. Japanese encephalitis      | Integer                         |
| 10. Leprosy                   | Integer                         |
| 11. Malaria                   | Integer                         |
| 12. Measles                   | Integer                         |
| 13. Mumps                     | Integer                         |
| 14. Neonatal Tetanus          | Integer                         |
| 15. Pertussis                 | Integer                         |
| 16. Plague                    | Integer                         |
| 17. Poliomyelitis             | Integer                         |
| 18. Rubella                   | Integer                         |
| 19. Total Tetanus             | Integer                         |
| 20. Tuberculosis              | Integer                         |
| 21. Yellow Fever              | Integer                         |
| 22. Cutaneous Leishmaniasis   | Integer                         |
| 23. Visceral Leishmaniasis    | Integer                         |

For more information on this data set:
http://apps.who.int/gho/data/node.home

##Obtaining the Data
Lets begin by programmatically obtaining the data.  Here I'll define a function we can use to make HTTP requests and download the data

### 

In [38]:
def download_file(url, local_filename):
    import requests
    
    # stream = True allows downloading of large files; prevents loading entire file into memory
    r = requests.get(url, stream = True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()

Now we'll specify the url of the file and the file name we will save to

In [39]:
url = 'https://raw.githubusercontent.com/dsiufl/Python-Workshops/master/GlobalHealthData.csv'
filename = 'GlobalHealthData.csv'

And make a call to <code>download_file</code>

In [40]:
download_file(url, filename)

**Note:**  If you see an InsecurePlatformWarning message, ignore it. More info can be found here: https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning

Now this might seem like overkill for downloading a single, small csv file, but we can use this same function to access countless APIs available on the World Wide Web by building an API request in the url.

##Wrangling the Data
Now that we have some data, lets get it into a useful form.  For this task we will use a package called pandas. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for Python.  The most fundamental data structure in pandas is the dataframe, which is similar to the data.frame data structure found in the R statistical programming language.

For more information: http://pandas.pydata.org

pandas dataframes are a 2-dimensional labeled data structures with columns of potentially different types.  Dataframes can be thought of as similar to a spreadsheet or SQL table.

There are numerous ways to build a dataframe with pandas.  Since we have already attained a csv file, we can use a parser built into pandas called <code>read_csv</code> which will read the contents of a csv file directly into a data frame.

For more information: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.parsers.read_csv.html

In [41]:
import pandas as pd # import the module and alias it as pd

health_data = pd.read_csv('GlobalHealthData.csv')
health_data.head() # show the first few rows of the data

Unnamed: 0,Country,Year,Number of new reported cases of human African trypanosomiasis (T.b. gambiense),Number of new reported cases of human African trypanosomiasis (T.b. rhodesiense),Number of reported cases of cholera,Number of suspected meningitis cases reported,Congenital Rubella Syndrome - number of reported cases,Diphtheria - number of reported cases,Japanese encephalitis - number of reported cases,Leprosy - number of reported cases,...,Neonatal tetanus - number of reported cases,Pertussis - number of reported cases,Plague - number of reported cases,Poliomyelitis - number of reported cases,Rubella - number of reported cases,Total tetanus - number of reported cases,Tuberculosis - new and relapse cases,Yellow fever - number of reported cases,Number of cases of cutaneous leishmaniasis reported,Number of cases of visceral leishmaniasis reported
0,Afghanistan,2014,,,,,0.0,0.0,,,...,17,0,,28.0,43.0,39,,,,
1,Afghanistan,2013,,,3957.0,,0.0,0.0,0.0,39.0,...,13,371,,17.0,367.0,24,30507.0,0.0,23621.0,16.0
2,Afghanistan,2012,,,12.0,,,0.0,,37.0,...,21,1497,,37.0,,37,28679.0,,33894.0,24.0
3,Afghanistan,2011,,,3733.0,,,0.0,,50.0,...,20,0,,80.0,750.0,20,27983.0,,31293.0,21.0
4,Afghanistan,2010,,,2369.0,,0.0,0.0,0.0,51.0,...,23,0,,25.0,46.0,23,28029.0,0.0,32145.0,11.0


Lets take a look at some simple statistics for the **Cholera** column

In [42]:
health_data["Number of reported cases of cholera"].describe()

count       245.000000
mean       6134.971429
std       26976.057407
min           1.000000
25%                NaN
50%                NaN
75%                NaN
max      340311.000000
Name: Number of reported cases of cholera, dtype: float64

Referring to the documentation, the data contains 1164 entries. However, if we take a look at the "count" section, it shows only 245 entries. This is because the original data is filled with empty strings, which pandas automatically converts to Numpy's <code>nan</code> datatype, or "Not a Number". 

Lets take a look at another column, this time **Pertussis**

In [43]:
health_data["Pertussis - number of reported cases"].describe()

count     972
unique    374
top         0
freq      333
Name: Pertussis - number of reported cases, dtype: object

Well at least the name is correct.  We were expecting a mean and standard deviation, and now the data type is an object.  

Whats up with our data?

We have arrived at arguably the most important part of performing data science: dealing with messy data.  One of most important tools in a data scientist's toolbox is the ability to audit, clean, and reshape data.  The real world is full of messy data and your sources may not always have data in the exact format you desire.

In this case we are working with csv data, which is a relatively straightforward format, but this will not always be the case when performing real world data science.  Data comes in all varieties from csv all the way to something as unstructured as a collection of emails or documents.  A data scientist must be versed in a wide variety of technologies and methodologies in order to be successful.

Now, lets do a little bit of digging into why were are not getting a numeric pandas column

In [44]:
health_data["Pertussis - number of reported cases"].unique()

array(['0', '371', '1497', nan, '6', '16', '4', '10', '69', '104', '1',
       '3', '1259', '1554', '2539', '1127', '561', '1112', '1239', '1594',
       '804', '1743', '85', '30', '8', '11', '11842', '12319', '23855',
       '38040', '34285', '29545', '571', '309', '414', '183', '18', '27',
       '15', '5', '12', '13', '44', '17', '378', '188', '576', '151',
       '112', '93', '1501', '1142', '548', '103', '133', '53', '75', '31',
       '19', '25', '7687', '5211', '5400', '2257', '477', '1037', '52',
       '89', '102', '46', '54', '251', '7', '68', '372', '513', '1527',
       '1261', '4845', '676', '759', '1667', '241', '124', '100', '63',
       '1964', '5762', '2582', '794', '692', '3408', '1712', '2183',
       '2517', '1764', '1612', '734', '13682', '3289', '1010', '344',
       '407', '21', '137', '58', '130', '79', '71', '664', '108', '61',
       '42', '101', '9', '2', '2521', '1233', '738', '324', '662', '956',
       '3108', '3407', '2452', '2157', '830', '855', '484', '

Using <code>unique</code> we can see that '0 0', '5 5', and '2 2' all appear as distinct values in this series. Because of the space between the numbers, Python has classified these as *strings* rather than *integers*. Indeed, it's not immediately obvious that these were meant to be legitimate entries in the first place.

Lets see what we can do with these unrecognized values. 

In [48]:
health_data["Pertussis - number of reported cases"] = pd.to_numeric(health_data["Pertussis - number of reported cases"])

Here we have attempted to convert the **Pertussis** series to a numeric type.  Lets see what the unique values are now.

In [49]:
health_data["Pertussis - number of reported cases"].unique()

array([  0.00000000e+00,   3.71000000e+02,   1.49700000e+03,
                    nan,   6.00000000e+00,   1.60000000e+01,
         4.00000000e+00,   1.00000000e+01,   6.90000000e+01,
         1.04000000e+02,   1.00000000e+00,   3.00000000e+00,
         1.25900000e+03,   1.55400000e+03,   2.53900000e+03,
         1.12700000e+03,   5.61000000e+02,   1.11200000e+03,
         1.23900000e+03,   1.59400000e+03,   8.04000000e+02,
         1.74300000e+03,   8.50000000e+01,   3.00000000e+01,
         8.00000000e+00,   1.10000000e+01,   1.18420000e+04,
         1.23190000e+04,   2.38550000e+04,   3.80400000e+04,
         3.42850000e+04,   2.95450000e+04,   5.71000000e+02,
         3.09000000e+02,   4.14000000e+02,   1.83000000e+02,
         1.80000000e+01,   2.70000000e+01,   1.50000000e+01,
         5.00000000e+00,   1.20000000e+01,   1.30000000e+01,
         4.40000000e+01,   1.70000000e+01,   3.78000000e+02,
         1.88000000e+02,   5.76000000e+02,   1.51000000e+02,
         1.12000000e+02,

The decimal point after each number means that it is an integer value being represented by a floating point number.  Now instead of our pesky *strings* we have <code>nan</code> (not a number).  <code>nan</code> is a construct used by pandas to represent the absence of value.  It is a data type that comes from the package numpy, used internally by pandas, and is not part of the standard Python library.

Now that we have <code>nan</code> values in place of strings, we can use some nice features in pandas to deal with these missing values.

What we are about to do is what is called "imputing" or providing a replacement for missing values so the data set becomes easier to work with.  There are a number of strategies for imputing missing values, all with their own pitfalls.  In general, imputation introduces some degree of bias to the data, so the imputation strategy taken should be in an attempt to minimize that bias.

Here, we will simply ignore all of the <code>nan</code> values, however other strategies such as replacing the <code>nan</code>'s with the mean of the data are also commonly used.

In [50]:
health_data = health_data.convert_objects(convert_numeric=True)
health_data["Pertussis - number of reported cases"].unique()

  if __name__ == '__main__':


array([  0.00000000e+00,   3.71000000e+02,   1.49700000e+03,
                    nan,   6.00000000e+00,   1.60000000e+01,
         4.00000000e+00,   1.00000000e+01,   6.90000000e+01,
         1.04000000e+02,   1.00000000e+00,   3.00000000e+00,
         1.25900000e+03,   1.55400000e+03,   2.53900000e+03,
         1.12700000e+03,   5.61000000e+02,   1.11200000e+03,
         1.23900000e+03,   1.59400000e+03,   8.04000000e+02,
         1.74300000e+03,   8.50000000e+01,   3.00000000e+01,
         8.00000000e+00,   1.10000000e+01,   1.18420000e+04,
         1.23190000e+04,   2.38550000e+04,   3.80400000e+04,
         3.42850000e+04,   2.95450000e+04,   5.71000000e+02,
         3.09000000e+02,   4.14000000e+02,   1.83000000e+02,
         1.80000000e+01,   2.70000000e+01,   1.50000000e+01,
         5.00000000e+00,   1.20000000e+01,   1.30000000e+01,
         4.40000000e+01,   1.70000000e+01,   3.78000000e+02,
         1.88000000e+02,   5.76000000e+02,   1.51000000e+02,
         1.12000000e+02,

<code>health_data.mean().round()</code> will take the mean of each column (this computation ignores the currently present nan values), then round, and return a dataframe indexed by the columns of the original dataframe.

This function can be used to replace all missing values with the mean of each column. In this tutorial however, we will not use this method, because the large number of missing values would greatly skew our standard deviations.

In [51]:
health_data.mean().round()

Year                                                                                  2012.0
Number of new reported cases of human African trypanosomiasis (T.b. gambiense)         370.0
Number of new reported cases of human African trypanosomiasis (T.b. rhodesiense)        21.0
Number of reported cases of cholera                                                   6135.0
Number of suspected meningitis cases reported                                         2175.0
Congenital Rubella Syndrome - number of reported cases                                   1.0
Diphtheria - number of reported cases                                                   33.0
Japanese encephalitis - number of reported cases                                        48.0
Leprosy - number of reported cases                                                    1949.0
Malaria - number of reported confirmed cases                                        316480.0
Measles - number of reported cases                                    

Now that we have figured out how to impute these missing values, lets start over and quickly apply this technique to the entire dataframe.

In [52]:
health_data = pd.read_csv('GlobalHealthData.csv')
health_data = health_data.convert_objects(convert_numeric=True)
health_data["Tuberculosis - new and relapse cases"].describe()

  from ipykernel import kernelapp as app


count    9.480000e+02
mean     3.055099e+04
std      1.227259e+05
min      0.000000e+00
25%               NaN
50%               NaN
75%               NaN
max      1.351913e+06
Name: Tuberculosis - new and relapse cases, dtype: float64

In [21]:
health_data["Tuberculosis - new and relapse cases"].unique()

array([             nan,   3.05070000e+04,   2.86790000e+04,
         2.79830000e+04,   2.80290000e+04,   2.61500000e+04,
         4.74000000e+02,   4.08000000e+02,   4.22000000e+02,
         4.31000000e+02,   4.45000000e+02,   2.07010000e+04,
         2.18800000e+04,   2.14290000e+04,   2.23360000e+04,
         2.17010000e+04,   5.00000000e+00,   9.00000000e+00,
         3.00000000e+00,   7.00000000e+00,   8.00000000e+00,
         5.86070000e+04,   5.18190000e+04,   4.72400000e+04,
         4.46550000e+04,   4.12210000e+04,   1.00000000e+01,
         6.00000000e+00,   8.93300000e+03,   8.75800000e+03,
         9.73300000e+03,   7.33600000e+03,   7.70100000e+03,
         1.39700000e+03,   1.21300000e+03,   1.26100000e+03,
         1.41000000e+03,   1.56000000e+03,   1.25000000e+03,
         1.30500000e+03,   1.23900000e+03,   1.25700000e+03,
         1.29400000e+03,   6.24000000e+02,   6.20000000e+02,
         6.73000000e+02,   6.62000000e+02,   6.63000000e+02,
         5.86000000e+03,

Structurally, Pandas dataframes are a collection of Series objects sharing a common index.  In general, the Series object and Dataframe object share a large number of functions with some behavioral differences.  In other words, whatever computation you can do on a single column can generally be applied to the entire dataframe.

Now we can use the dataframe version of <code>describe</code> to get an overview of all of our data

In [22]:
health_data.describe()

Unnamed: 0,Year,Number of new reported cases of human African trypanosomiasis (T.b. gambiense),Number of new reported cases of human African trypanosomiasis (T.b. rhodesiense),Number of reported cases of cholera,Number of suspected meningitis cases reported,Congenital Rubella Syndrome - number of reported cases,Diphtheria - number of reported cases,Japanese encephalitis - number of reported cases,Leprosy - number of reported cases,Malaria - number of reported confirmed cases,...,Neonatal tetanus - number of reported cases,Pertussis - number of reported cases,Plague - number of reported cases,Poliomyelitis - number of reported cases,Rubella - number of reported cases,Total tetanus - number of reported cases,Tuberculosis - new and relapse cases,Yellow fever - number of reported cases,Number of cases of cutaneous leishmaniasis reported,Number of cases of visceral leishmaniasis reported
count,1164.0,109.0,36.0,245.0,96.0,749.0,953.0,487.0,588.0,503.0,...,963.0,962.0,33.0,970.0,1010.0,1008.0,948.0,789.0,244.0,246.0
mean,2011.5,369.844037,21.444444,6134.971429,2175.25,1.311081,32.56978,47.926078,1949.170068,316480.3,...,27.439252,1095.180873,27.545455,3.189691,535.10099,71.886905,30550.99,5.012674,4099.094262,901.495935
std,1.708559,1291.601046,32.5747,26976.057407,6133.867934,9.39767,322.362516,287.843712,12400.256473,729097.4,...,124.04224,4768.084078,74.925835,26.790189,3970.447112,303.368764,122725.9,45.072388,9779.236568,3670.883278
min,2009.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2010.0,,,,,,,,,,...,,,,,,,,,,
50%,2011.5,,,,,,,,,,...,,,,,,,,,,
75%,2013.0,,,,,,,,,,...,,,,,,,,,,
max,2014.0,7183.0,129.0,340311.0,56128.0,189.0,6094.0,3913.0,134752.0,6715223.0,...,1412.0,60385.0,313.0,460.0,69860.0,5017.0,1351913.0,1024.0,71996.0,33155.0
