<a href="https://colab.research.google.com/github/esohman/EADH/blob/main/3_EADH_intermediate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#The Intermediate notebook
Welcome to the intermediate notebook.
By now you should have a solid grasp of how different data types work and how they can be used and manipulated in different situations.

You should also understand for loops and list comprehension and be able to do small scale NLP projects using NLTK and/or spaCy as well as understand how to create and use regex patterns.

If you feel that this is not the case, I recommend that you go back and revise the sections that you feel you do not yet quite understand. Look at the notebooks themselves and follow the links to additional resources to better understand all the different aspects of Python programming at the beginner level.

Welcome to the intermediate level!




#Functions

The first thing we are going to discuss is functions. Functions are a way to reuse code. Instead of writing the same or similar piece of code multiple times, you can create a function and call it every time you need it.

Say you have a list of tuples that contain the heights and bases of triangles. You can create a for loop to go through that list and in the loop pass those values to your function that calculates the area of the triangle.

Check out the excellent [video](https://pythonhumanities.com/lesson-10-python-functions/) on functions on Python for DH.


---


We create functions with the keyword **def**, this is followed by the name of the function and then parentheses that can be left empty, but typically contain the names of the variables you want to use inside your function that have been passed to the function when the function was first called. In the code below, you pass the arguments i,j to the function tri_area. Inside that function i,j are known as h,b. As a sidenote, i and j are known as arguments when they are passed to the function, h and b are known as parameters. Some people use these interchangeably, but this is their correct meaning and there is a distinction.

Functions have a return function. the return function states what it is that the function outputs.

In [None]:
def tri_area(h,b):
  return (h*b)/2

lst = [(2.45,4),(3.3,5.4),(4.2,2.7),(4,4),(68,45)]
for i,j in lst:
  print(tri_area(i,j))



A more complex example where we call another function from within a function:

In [None]:
from math import pi

def area_circle(r):
  return pi * r ** 2

def vol_cylinder(r,h):
  return area_circle(r)*h

#we can use the same list of values from the previous example
lst = [(2.45,4),(3.3,5.4),(4.2,2.7),(4,4),(68,45)]

for i,j in lst:
  print(f'The area of the circle is {"{:.2f}".format(area_circle(i))} and the volume of the cylinder is {"{:.2f}".format(vol_cylinder(i,j))}')


###format()
You might have noticed that we used .format() when printing. You do not have to use format here, but you can use it to specify the number of decimal places you want to show.

Learn more about format() [here](https://www.w3schools.com/python/ref_string_format.asp).

###input
Sometimes you want to get user input. Now, there are ways of creating websites and graphical user interfaces in Python, but what if we just want a value or two for interactiveness in our script?

In [None]:
def monthly_pay(hrs,hpay,extra=0): #we can set a default value to a parameter
  return (hrs*hpay)*0.8+extra #let's assume a tax rate of 20%, this could be a parameter too
  
  
hours = float(input("How many hours did you work this month: "))
hourly = float(input("What is your hourly pay: "))


e = input("Are you getting a bonus or similar?(y/n) ")

if e == "y":
   extra = float(input("How much: "))
elif e == "n":
   extra = 0
else:
   print("Error")

print(f'Your monthly take-home pay is {monthly_pay(hours,hourly,extra)}')


##Modules, libraries, and documentation
More and more we have started using external modules and libraries. These libraries have functions that are not built into your base Python installation. Sometimes you need to use pip install to install them (on colab you can sue pip with "!pip install pandas" where pandas is the name of the library you want to install. On Colab, many of the most commonly used libraries are already installed, so you rarely have to do this.)

We import these libraries so that we can use them in our code. We do this by typing import and the name of the library we want. We can also import only certain parts of a library. In the functions example we could have written#

```
import math

print(math.pi*r**2)
```
or
```
from math import pi
print(pi*r**2)
```
we can also rename the libraries we are importing
```
import pandas as pd
```
All decent libraries come with documentation. Documentation is very important in learning to understand how to use new libraries, or how to get the most out of familiar libraries.

If you are ever stumped on how to do something, the documentation of the library you are using should be one of the first places you look for more information. Stackoverflow, is another top two contender.


#Pandas & numpy

pandas is a highly useful Python library for data analysis.
Really, anytime you are dealing with csv files, you should consider if pandas might be the best option for the task at hand.

pandas is built-on numpy and using numpy mathematical functions with pandas is quick, easy, and stress-free.

With pandas we can create dataframes, which are kind of like spreadsheets in that we have rows and columns of data. With pandas it is very easy to manipulate this data.


## Series
Series is like a one-column dataframe. It cannot have a column name, but it can have a series name and you can name the rows.


In [None]:
import pandas as pd

my_list = [123,2134,123] # a list of integers

#let's make this list into a pandas series
my_series = pd.Series(my_list)

We can test the difference between a list and that same list having been converted to a series.

In [None]:
print(f'My list: {my_list}\nMy Series: \n{my_series}')

Just like lists, we can use square brackets to access elements in the series. In this case, you can think of the index as the rown number. The default numbering starts from 0.

In [None]:
#accessing individual elements
print(my_series[1])

We can set the index names when we create the series, or we can rename the index after creation:

In [None]:
#renaming the index
my_series1 = pd.Series(my_list,index = ["first", "second", "third"])
my_series2 = my_series.rename(index = {0:"first",1:"second",2:"third"}) #there is also the "inplace" option. Remember what it does?

In [None]:
print(f'1:\n{my_series1}\n2: \n{my_series2}')

Series can also be created from dictionaries in addition to lists. In this case the dictionary key becomes the index name and the value becomes the value of that index.

In [None]:
#Creating a series from a dictionary
dicty = {"first":"lalala","second":123,"third":99.4}
my_dictseries = pd.Series(dicty) #if you specify the index here, the series will only consist of the specified indexes e.g.: my_dictseries = pd.Series(dicty, index = ["first","third"])

In [None]:
print(f'Series from dict:\n{my_dictseries}')

##Dataframes
Dataframes are the essence of pandas and most data analysis with Python relies on pandas dataframes. If series are like columns in a spreadsheet, dataframes are like individual sheets ro tables. We can combines series to make dataframes, we can read in csv or Excel files to create dataframes, or we can create them from dictionaries for example. We can also create dataframes from a variety of other sources such as json files.

Kaggle has a [great tutorial on pandas](https://www.kaggle.com/learn/pandas) if you want to dive in deeper into basic data analysis. 

Kaggle also offers additional pandas tutorials of varying quality on many different topics. [This link](https://www.kaggle.com/search?q=pandas) will take you to up-to-date search results of pandas tutorials on Kaggle.

In [None]:
#we can merge series to create a dataframe
df1 = pd.concat([my_series1, my_series2,my_dictseries], axis=1)

#or we can create one from a dictionary (typically a dictionary of lists)
d = {"ex1":[89,8,6,1,2,7,6],"ex2":[7,5,1,66,8,74,1]}
df2 = pd.DataFrame(d)

In [None]:
print(f'Df1:\n{df1}\nDf2:\n{df2}')

###loc and iloc
We can use loc and iloc to access specific information in a dataframe.

In [None]:
#access specific row using loc
df2.iloc[3] # we are accessing index 3, which gives us the values for all columns

In [None]:
df2.iloc[:,1] # we can also use iloc to access a single column
#the first part is to say that we want all the indexes, the second number is the column index (which also starts from 0)

iloc slices work just like list slices.

Although iloc an loc are virtually identical, there is one huge difference: iloc is index-based and loc is name-based.

Compare:

In [None]:
df2.iloc[3] #accessing index 3

In [None]:
df2.loc[3] #accessing the index named 3

In [None]:
df1.iloc[1] #accessing the second index, i.e. index 1 for columns 0,1, and 2

In [None]:
df1.loc[1]#gets you an error

In [None]:
df1.loc["second"]

we can also access values based on conditions

In [None]:
print(df2.ex2 > 10) #this gets us a series of booleans
print(df2.loc[df2.ex2 > 10]) #this shows the values that are True

### Reading in a csv files
The most common file type you will ikely be reading in is a csv file. We can simply use read_csv to create a dataframe from the data

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv') 
#you can replace the path with any csv file, such as a dataset from Kaggle

In [None]:
#Head gets us the first few rows of data in our dataframe. The default number of rows is 5, but we can specify a number too.
df.head(12)

In [None]:
#Tail works the same as head but gets us the last rows
df.tail(7)

In [None]:
#we can also use info and shape to get an overview of the data
df.info()

In [None]:
df.shape

In [None]:
#and you can access specific column simply by using their name as a list
print(df["Country"])

In [None]:
print(df.Country) #or if only one, just like this:

In [None]:
#how about a specific cell
df.Country[121]

In [None]:
#or
df["Country"][122]

## numpy

When we use numpy with pandas, we have access to powerful math tools. numpy arrays are also very efficient, much more so than Python lists, so if you have a lot of data that you need to loop thorugh, it might be worth converting your lists to numpy arrays first.

We are not exploring numpy in depth in this notebook, but let's look at the basics.

In [None]:
import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)

print(type(arr))

In [None]:
mylist = [5,3,4,7,8]

arr2 = np.array(mylist) #this doesn't have to be a list. It can also be a tuple or other sequence
arr2

In [None]:
arr

These numpy arrays are 1-dimensional. Numpy arrays can have multiple dimensions though.

In [None]:
np0 = np.array(3) #0D
np1 = np.array([1,2,3,4]) #1D
np2 = np.array([[1,2,3,4],[5,6,7,8]]) #2D etc

In [None]:
#access elements
np1[2]

In [None]:
np2[1,3]

The beauty of numpy is not only in its arrays, but in the copious mathematical fucntions it offers. You will use them with data analysis.

#Data Analysis
For more basic pandas I recommend the Kaggle notebook linked to above, but if you need something even more basic, [W3Schools also offer a pandas tutorial](https://www.w3schools.com/python/pandas/default.asp) that deals with the very basics.


I whole-heartedly recommend [this notebook](https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Exploratory_data_Analysis.ipynb#scrollTo=0oVZnezwQ159) for getting started with Exploratory Data Analysis. It covers all the basic functions you will need to examine a data set (like a csv file).

Kaggle also has a good exploratory [data analysis tutorial](https://www.kaggle.com/kashnitsky/topic-1-exploratory-data-analysis-with-pandas) with links to further resources.

Another great resource is the Kaggle pandas tutorial linked to above, and also [this pandas cheatsheet with excercises](https://www.kaggle.com/rajacsp/pandas-cheatsheet-125-exercises) from Kaggle.

In [None]:
#Let's check how many null values we have (empty cells or cells with a value of NaN)
df.isnull().sum()

In [None]:
#We can also check unique values
df.Region.unique()

In [None]:
#or just the number of unique values
df.Region.nunique()

In [None]:
#Another useful function is groupby
df.groupby("Region").Country.count()

###numpy with pandas
Let's load a different dataset with numbers so that we can check out some math functions

This is a dataset that lists eurovision song contest votes

In [None]:
df = pd.read_csv('https://github.com/Spijkervet/eurovision-dataset/releases/download/2020.0/votes.csv') 
df.head()

In [None]:
df1980 = df.loc[df.year == 1980] #let's create a subset of the dataset for just the year 1980
df1980

In [None]:
df1980.total_points.mean() #we can calculate the mean of an entire column

In [None]:
df1980.total_points.median() #or the median

#Visualization

Kaggle comes thourgh again with a [great tutorial on visualizations](https://www.kaggle.com/learn/data-visualization).

We will be working on many similar things in the next section.

In [None]:
import seaborn as sns                       #visualization
import matplotlib.pyplot as plt             #visualization
%matplotlib inline     
sns.set(color_codes=True)

###correlation matrix
In this matrix we are looking at how well the numerical column values correlate with each other throughout all the years.

In [None]:
plt.figure(figsize=(10,5))
cm = df.corr()
sns.heatmap(cm,cmap="OrRd",annot=True)
cm #why are we only getting a correlation matrix for 3 columns?

In [None]:
plt.figure(figsize=(10,8),dpi =150)
sns.scatterplot(data = df , x = 'year',y= 'total_points',hue = 'to_country')

## bad visualization
This is a great example of a terrible visualization.
1. It is a technically a bad visualization because it is hard to see anything and the legend covers part of the plot and is very long
2. It is also practically bad because we do not gain much insights from it
3. It is also ugly because there are so many countries that the difference in color is almost impossible to differentiate in the graph.

Perhaps we can fix it.



In [None]:
temp = df.groupby("year").total_points.mean()
temp = pd.DataFrame(temp)
temp

In [None]:
plt.figure(figsize=(10,8),dpi =150)
sns.scatterplot(data = temp , x = 'year',y='total_points')

That's better! We can see that it is quite likely there have been four different voting systems in use.

For more visualization with seaborn and matplotlib + pandas, take a look at [this tutorial](https://www.kaggle.com/learn/data-visualization) on Kaggle.

# ALL DONE!
You have now completed the intermediate notebook and can move onto the advanced one!