# Tutorial 1A Getting started 

<br>

![](http://i2.wp.com/www.fantasticmaps.com/wp-content/uploads/2013/03/World_Of_Ice_And_Fire.jpg)


--------------------

# The Jupyter Notebook

The Jupyter notebook is a valuable tool for data science. The ability to run code as well as document and add diagrams with Markdown, makes it an 'all-in-one' solution to doing analyses and communicating your findings.

we are going to run through a few of the features of the notebook and Markdown. It is not a comprehensive discussion of the features at all. But hopefully it will give you enough of a taste so you will consider it worthwile researching further.

-------------

# 1. Some Basic Notebook Shortcut Keys

There are two 'modes' in the Jupyter Notebook. 

- Hit the 'esc' key for **command mode**, the cell border will turn blue
- Hit the 'return' key for **edit mode**, cell border turns green

Command mode lets us talk to the Notebook, edit mode is where we write code in the cell.

Finally
- Hit 'shift' +'return' to run the code. They are side by side on the right of the keyboard. 

## In command mode (blue border)

Key|Function
--|--
b|create cell below
a|create cell above
x|cut cell
v|paste cell
m|change to markdown cell
y|change to code cell
l|line numbers

## Most importantly

If you hit the 'esc' key to go into command mode and hit 'h', you will bring up a help page with all the shortcuts.

# 2. Markdown

Markdown is how you can write text, create links, show images and videos.
Markdown is ironically a mark-up language. You might know of other mark-up languages like HTML, where a small number of character combinations are used to indicate formatting of the text.

Markdown's set of formatting characters are quite unobtrusive and take little effort to write and don't get in the way of reading it, even in it's raw (unformatted) state.

## Headings

the '#' key will create headings, with '#' being the biggest up to '######' being the smallest.

## Bullet points 

the '-' will create a bullet point

- Bread
- Milk
- Butter
- Cheese

### Numbered lists

Replace the '-' with numbers (they numbers don't have to be in order), to get a numbered list

1. Learn Data Science
2. Do Analysis
3. Profit!

### Links and images

Creating links is done with the `[]()` combination. Place the URL in the round brackets and the link text in the square brackets. 

[Markdown Cheatsheet](https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf) 

Images uses the same combination with an exclamation mark in front `![]()`

![](http://slatestarcodex.com/blog_images/hit_angel.png)

There are many other tricks with markdown in the Jupyter Notebook as well as Notebook Magics. Look online at the Jupyter Notebook documentation. 

# 3. Pandas basics

First you need to import the pandas library, not the bamboo eating bear... 'Panel Data' 

In [1]:
import pandas as pd  

Let's import the UFO sightings dataset

In [2]:
ufo = pd.read_csv('http://bit.ly/uforeports')
ufo.head() 

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


Learn how to properly indexing columns and rows will speed up you coding process. 

Check [here](http://pandas.pydata.org/pandas-docs/stable/indexing.html). 

In [3]:
ufo['City'].head()

0                  Ithaca
1             Willingboro
2                 Holyoke
3                 Abilene
4    New York Worlds Fair
Name: City, dtype: object

In [4]:
ufo.City.head()

0                  Ithaca
1             Willingboro
2                 Holyoke
3                 Abilene
4    New York Worlds Fair
Name: City, dtype: object

In [5]:
ufo[['City','State']].head()

Unnamed: 0,City,State
0,Ithaca,NY
1,Willingboro,NJ
2,Holyoke,CO
3,Abilene,KS
4,New York Worlds Fair,NY


In [6]:
ufo.loc[:,'City':'State'].head()

Unnamed: 0,City,Colors Reported,Shape Reported,State
0,Ithaca,,TRIANGLE,NY
1,Willingboro,,OTHER,NJ
2,Holyoke,,OVAL,CO
3,Abilene,,DISK,KS
4,New York Worlds Fair,,LIGHT,NY


In [7]:
ufo.iloc[1:6,2:6]

Unnamed: 0,Shape Reported,State,Time
1,OTHER,NJ,6/30/1930 20:00
2,OVAL,CO,2/15/1931 14:00
3,DISK,KS,6/1/1931 13:00
4,LIGHT,NY,4/18/1933 19:00
5,DISK,ND,9/15/1934 15:30


In [8]:
ufo.City=='Ithaca'

0         True
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
18211    False
18212    False
18213    False
18214    False
18215    False
18216    False
18217    False
18218    False
18219    False
18220    False
18221    False
18222    False
18223    False
18224    False
18225    False
18226    False
18227    False
18228    False
18229    False
18230    False
18231    False
18232    False
18233    False
18234    False
18235    False
18236    False
18237    False
18238    False
18239    False
18240    False
Name: City, Length: 18241, dtype: bool

In [9]:
ufo.City[ufo.City=='Ithaca']

0        Ithaca
4068     Ithaca
5631     Ithaca
6961     Ithaca
7573     Ithaca
9088     Ithaca
16537    Ithaca
17049    Ithaca
Name: City, dtype: object

In [10]:
ufo[['City','State']][ufo.City=='Ithaca']


Unnamed: 0,City,State
0,Ithaca,NY
4068,Ithaca,NY
5631,Ithaca,MI
6961,Ithaca,NY
7573,Ithaca,NY
9088,Ithaca,NY
16537,Ithaca,MI
17049,Ithaca,NY



## Now try importing and indexing yourself 

Can you import this dataset and find out which country has the highest beer consumption per person?

http://apps.who.int/gho/athena/data/xmart.csv?target=GHO/SA_0000001400&profile=crosstable&filter=COUNTRY:*;YEAR:2012&x-sideaxis=COUNTRY;DATASOURCE;ALCOHOLTYPE&x-topaxis=GHO;YEAR

The dataset is from WHO:
http://apps.who.int/gho/data/node.main.A1026?lang=en

In [11]:
Alcohol = pd.read_csv('http://apps.who.int/gho/athena/data/xmart.csv?target=GHO/SA_0000001400&profile=crosstable&filter=COUNTRY:*;YEAR:2012&x-sideaxis=COUNTRY;DATASOURCE;ALCOHOLTYPE&x-topaxis=GHO;YEAR',skiprows=1)
Alcohol.head(10)

Unnamed: 0,Country,Data Source,Beverage Types,2012
0,Afghanistan,Data source,All types,0.01
1,Afghanistan,Data source,Beer,0.01
2,Afghanistan,Data source,Wine,0.0
3,Afghanistan,Data source,Spirits,0.0
4,Afghanistan,Data source,Other alcoholic beverages,0.0
5,Albania,Data source,All types,5.14
6,Albania,Data source,Beer,1.68
7,Albania,Data source,Wine,1.33
8,Albania,Data source,Spirits,2.05
9,Albania,Data source,Other alcoholic beverages,0.09


In [12]:
Alcohol[Alcohol['Beverage Types']=="Beer"].head()

Unnamed: 0,Country,Data Source,Beverage Types,2012


Why it didn't work ? 

There are white spaces in some of columns !

In [13]:
Alcohol[Alcohol['Beverage Types']==" Beer"].sort_values([" 2012"])

Unnamed: 0,Country,Data Source,Beverage Types,2012
423,Pakistan,Data source,Beer,0.00
498,Saudi Arabia,Data source,Beer,0.00
63,Bangladesh,Data source,Beer,0.00
1,Afghanistan,Data source,Beer,0.01
258,Haiti,Data source,Beer,0.01
628,Yemen,Data source,Beer,0.04
313,Jordan,Data source,Beer,0.05
283,Indonesia,Data source,Beer,0.06
168,Democratic People's Republic of Korea,Data source,Beer,0.09
193,Egypt,Data source,Beer,0.13


# 4. The way of using Pandas 


First, a couple of rows have missing city data that I'll drop to simplify what we're doing.

In [14]:
ufo[10:20]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
10,Fontana,,LIGHT,CA,8/15/1937 21:00
11,Waterloo,,FIREBALL,AL,6/1/1939 20:00
12,Belton,RED,SPHERE,SC,6/30/1939 20:00
13,Keokuk,,OVAL,IA,7/7/1939 2:00
14,Ludington,,DISK,MI,6/1/1941 13:00
15,Forest Home,,CIRCLE,CA,7/2/1941 11:30
16,Los Angeles,,,CA,2/25/1942 0:00
17,Hapeville,,,GA,6/1/1942 22:30
18,Oneida,,RECTANGLE,TN,7/15/1942 1:00
19,Bering Sea,RED,OTHER,AK,4/30/1943 23:00


In [15]:
ufo.dropna(axis = 0,inplace = True,how = 'any')
ufo[10:20]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
152,Irving,BLUE,DISK,KS,4/15/1951 0:30
157,Greenville,GREEN,DISK,MS,6/15/1951 20:30
163,Green River,GREEN,SPHERE,WY,7/3/1951 12:00
164,Provo,BLUE,DISK,UT,7/10/1951 23:30
174,Greenville,ORANGE,TRIANGLE,TX,4/15/1952 16:00
178,Norfolk,RED,FIREBALL,VA,6/1/1952 22:00
202,Arlington,GREEN,OVAL,VA,7/13/1952 21:00
226,Cambridge,RED,SPHERE,MA,4/1/1953 18:00
229,Midwest City,YELLOW,FIREBALL,OK,4/15/1953 16:00
238,Cleveland,RED,FIREBALL,OH,6/30/1953 0:00


In [16]:
ufo.reset_index(inplace= True)
ufo[10:20]

Unnamed: 0,index,City,Colors Reported,Shape Reported,State,Time
10,152,Irving,BLUE,DISK,KS,4/15/1951 0:30
11,157,Greenville,GREEN,DISK,MS,6/15/1951 20:30
12,163,Green River,GREEN,SPHERE,WY,7/3/1951 12:00
13,164,Provo,BLUE,DISK,UT,7/10/1951 23:30
14,174,Greenville,ORANGE,TRIANGLE,TX,4/15/1952 16:00
15,178,Norfolk,RED,FIREBALL,VA,6/1/1952 22:00
16,202,Arlington,GREEN,OVAL,VA,7/13/1952 21:00
17,226,Cambridge,RED,SPHERE,MA,4/1/1953 18:00
18,229,Midwest City,YELLOW,FIREBALL,OK,4/15/1953 16:00
19,238,Cleveland,RED,FIREBALL,OH,6/30/1953 0:00


### Creating a new column

I want to create a column with the combined City and State place names, I'll create a column called 'place' with an empty string in every row. This isn't absolutely necessary when using proper Pandas methods but for the demonstration it will make it more straight forward.

In [17]:
ufo['place'] = ''
ufo.head()

Unnamed: 0,index,City,Colors Reported,Shape Reported,State,Time,place
0,12,Belton,RED,SPHERE,SC,6/30/1939 20:00,
1,19,Bering Sea,RED,OTHER,AK,4/30/1943 23:00,
2,36,Portsmouth,RED,FORMATION,VA,7/10/1945 1:30,
3,44,Blairsden,GREEN,SPHERE,CA,6/30/1946 19:00,
4,82,San Jose,BLUE,CHEVRON,CA,7/15/1947 21:00,


## Timing it
The Notebook magic %%timeit will run the cell 1000 times and get the 3 quickest times.

In [18]:
%%timeit

# Using proper Pandas whole series operations
ufo['place'] = ufo['City'] + ', ' + ufo['State']

790 µs ± 84.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [19]:
ufo.head()

Unnamed: 0,index,City,Colors Reported,Shape Reported,State,Time,place
0,12,Belton,RED,SPHERE,SC,6/30/1939 20:00,"Belton, SC"
1,19,Bering Sea,RED,OTHER,AK,4/30/1943 23:00,"Bering Sea, AK"
2,36,Portsmouth,RED,FORMATION,VA,7/10/1945 1:30,"Portsmouth, VA"
3,44,Blairsden,GREEN,SPHERE,CA,6/30/1946 19:00,"Blairsden, CA"
4,82,San Jose,BLUE,CHEVRON,CA,7/15/1947 21:00,"San Jose, CA"


Do it again 

In [20]:
ufo['place'] = ''
ufo.head()

Unnamed: 0,index,City,Colors Reported,Shape Reported,State,Time,place
0,12,Belton,RED,SPHERE,SC,6/30/1939 20:00,
1,19,Bering Sea,RED,OTHER,AK,4/30/1943 23:00,
2,36,Portsmouth,RED,FORMATION,VA,7/10/1945 1:30,
3,44,Blairsden,GREEN,SPHERE,CA,6/30/1946 19:00,
4,82,San Jose,BLUE,CHEVRON,CA,7/15/1947 21:00,


In [21]:
ufo.index

RangeIndex(start=0, stop=2486, step=1)

In [22]:
ufo.iloc[0,0]

12

In [23]:
%%timeit

# Using a for loop to create each entry in turn
for i in ufo.index:
    ufo.iloc[i,6] = ufo.iloc[i,1] + ', ' + ufo.iloc[i,4]

989 ms ± 83.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [24]:
ufo.head()

Unnamed: 0,index,City,Colors Reported,Shape Reported,State,Time,place
0,12,Belton,RED,SPHERE,SC,6/30/1939 20:00,"Belton, SC"
1,19,Bering Sea,RED,OTHER,AK,4/30/1943 23:00,"Bering Sea, AK"
2,36,Portsmouth,RED,FORMATION,VA,7/10/1945 1:30,"Portsmouth, VA"
3,44,Blairsden,GREEN,SPHERE,CA,6/30/1946 19:00,"Blairsden, CA"
4,82,San Jose,BLUE,CHEVRON,CA,7/15/1947 21:00,"San Jose, CA"



### Now you see the difference 0.0009 s vs 1.79 s !!!! (The running time may vary on different computers!)

Pandas is based on numpy arrays, so try everything you can to aviod iterating over rows.

# Readable Juptyer notebook with Python code

### Readability counts!!!

As a data scientist you are communicating your insights to others. The reader may not be primarily a programmer. They may work in another language and they may not be a machine learning practitioner. This is why readability is important.

We use Python and the Jupyter notebook because they enhance **readability** and your ability to clearly explain your insights. It is important to **put yourself** in the position of **your reader**. You should explain your process so they don't have to spend too long working out what you have done and why.

Interestingly, there are a few good programming concepts that conflict with this readability.

One is the practice of defining lots of variables and functions at the start of the script. While this makes sense from a code development point of view because there is one place to change a value that might be used throughout the script. But from the point of view of the reader, it means they have to remember what the value of that variable is until it is used, maybe dozens of lines further down in the script. As well as defining the variables in the same section as the code, meaningful variable names will help the reader follow your process. Also do not save many useless dataframes, which confuses your reader and make your code slow too.

Another area to consider carefully is not try to amaze your reader with complicated solutions. Think twice when you are writing functions or even classes to perform actions. Simple is better than complex, and complex is better than complicated. Sometimes complex solutions can't be avoided because the problem is complex but consider if a step-by-step procedural approach might be clearer for the reader than an object oriented approach. Again, if you do define functions, defining them just before you use them will be easier on your reader than defining all functions at the start or the end of the script and making them remember how the function operates. 

On that same topic. Try to use the built in functions of the major libraries rather than writing your own. It is difficult to write code as an individual that is as optimised and versatile as the group that has developed the library. Many open source projects have hundreds of contributors and reviewers. Remember, "Many eyes make all bugs shallow". Take advantage of the work the community has done writing documentation and having a consistent interface. Scikit Learn is a great example of this. There is also the possibility that your reader might be familiar with the built in functions of the library you are using and will therefore understand what you are doing with less effort than if you create your own way to do the same task.

<br>


![](http://images-cdn.moviepilot.com/images/c_fill,h_630,w_1200/t_mp_quality/qlcekpbawgphnvalmzer/game-of-thrones-and-melisandre-let-s-talk-about-necklaces-949671.jpg)

One more thing: **"beautiful is better than ugly"**. Use headings and sub-headings in the Markdown cells and an introduction to guide your reader. Do not put all your code in one cell. A cell should be a self-contained function unit. 

Some additional informaiton: it is a good practice to structure your notebook, such as: 

* Introduction
* Notes on Data source, output data, versions of your program etc. 
* Data preparation steps
* Analysis (possibly with visualisation of the results)
* Discussion 
* Conclusions





# Home work 

Please go though those videos !!!! 

https://www.youtube.com/channel/UCnVzApLJE2ljPZSeQylSEyg