# Pandas tutorial

## The basics

In this tutorial you will learn the basics about using Pandas in Python.
Pandas is a library in Python, which can be used to manipulate and analyse data easier. You can read more about pandas on their [website](https://pandas.pydata.org/).

## Contents of the tutorial

1. Import libraries
    * math, pandas, matplotlib & seaborn, sklearn, sqlite2, numpy, statmodels
2. DataFrames
    * Concept
    * Commands
    * See the data
3. Series
    * Concept
    * Commands
    * See the data
4. Import datafile
    * CSV
    * Matlab
5. Statistics
6. Sorting and grouping a Dataframe
7. Adding a column by calculation
8. Delete data for optimization
9. Working with NaN values
10. Plotting a DataFrame

______________________

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-bottom:-20px"></div>

## 1. Import libraries

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-top:5px;"></div>

Libraries need to be imported so that you can work with them. There are several useful libraries:

- pandas: for using pandas. 
- math: can handle mathematical functions easily. 
- matplotlib: for producing quality 2D graphics.
- seaborn: for statistical graphics, on top of matplotlib.
- numpy: for linear algebra, lagre arrays etc..
- SciPy: for mathematics, science and engrineering, on top of Numpy.
- sklearn: for machine learning, on top of SciPy.
- sqlite3: can handle databases with SQL query language.
- statmodels: for statistical computations. 

Importing such libraries can be done by running the following line of code:

In [3]:
import pandas as pd

Next, import the other libraries. 

In [2]:
## TRY OUT: import the other libraries here


____________________

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-bottom:-20px"></div>

## 2. DataFrame

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-top:5px;"></div>

A DataFrame is a table just like the one you probably know from excel. It contains rows and columns for example this is a small DataFrame:


In [3]:
pd.DataFrame({'Yes':[50,21], 'No':[131,20]})

Unnamed: 0,Yes,No
0,50,131
1,21,20


You can see that the rows have the values 0 and 1 these indexes can be changed if you prefer it to other numbers or even to strings. 
The code you need for this is located in the cell below test it and see what happens.

In [4]:
pd.DataFrame({'Yes':[50,21], 'No':[131,20]},index = ['Product A','Product B'])

Unnamed: 0,Yes,No
Product A,50,131
Product B,21,20


_______________________

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-bottom:-20px"></div>

## 3. Series

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-top:5px;"></div>

Pandas also has something that is called series. A Series is a sequence of data values. A Series is like a list if a DataFrame is a table. To create one you just need some data values;

In [5]:
pd.Series([5,6,7,8])

0    5
1    6
2    7
3    8
dtype: int64

___________________

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-bottom:-20px"></div>

## 4. Import data files

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-top:5px;"></div>

Creating a DataFrame or Series is nice to do but the power of Pandas comes from the reading data that is already available for a csv file this can simply be done by using the `read_csv()` function from Pandas. 

In biomedical engineering sometimes the dataset we want to read is stored in a matlab file with the .mat extention

In [4]:
movies=pd.read_csv("Movies.csv")
movies#.head() # this shows the first 5 rows of the DataFrame

Unnamed: 0,title,year,lifetime_gross,ratingInteger,ratingCount,duration,nrOfWins,nrOfNominations,nrOfPhotos,nrOfNewsArticles,...,RealityTV,Romance,SciFi,Short,Sport,TalkShow,Thriller,War,Western,Unnamed: 37
0,METROPOLIS,1928,1236166,8,81007,9180,3,4,67,428,...,0,0,1,0,0,0,0,0,0,
1,CITY LIGHTS,1931,19181,9,70057,5220,2,0,38,187,...,0,1,0,0,0,0,0,0,0,
2,MODERN TIMES,1936,163577,9,90847,5220,3,1,44,27,...,0,0,0,0,0,0,0,0,0,
3,GONE WITH THE WIND,1939,198676459,8,160414,14280,10,6,143,1263,...,0,1,0,0,0,0,0,1,0,
4,THE WIZARD OF OZ,1939,22342633,8,209506,6120,6,12,126,2363,...,0,0,0,0,0,0,0,0,0,
5,CITIZEN KANE,1941,1585634,9,228617,7140,7,10,103,1439,...,0,0,0,0,0,0,0,0,0,
6,CASABLANCA,1942,4108411,9,296802,6120,5,6,153,382,...,0,1,0,0,0,0,0,1,0,
7,THE BEST YEARS OF OUR LIVES,1946,23650000,8,30002,10320,13,1,15,129,...,0,1,0,0,0,0,0,1,0,
8,THE THIRD MAN,1949,618173,8,85964,6240,3,4,62,338,...,0,0,0,0,0,0,1,0,0,
9,ALL ABOUT EVE,1950,63463,8,61738,8280,15,15,42,253,...,0,0,0,0,0,0,0,0,0,


The movie with the highest ratingCount can be found by using : 

In [5]:
mostrating = movies['ratingCount'].max()
mostrating

1183395

To import a matlab file, the following function can be used: ............

_________________________

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-bottom:-20px"></div>

## 5. Statistics

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-top:5px;"></div>

Statistics are also very important when working with datasets. In this section various functions will be discussed and used to find out some more about the dataset. 

On your cheat sheet you can find several functions in the section statistics. For example:
`df.count()`: returns the number of values.

In the next cell, try it out for the number of nominations.

In [10]:
## TRY OUT: find the number of nominations


ALso the function `df.describe()` can be very useful to get an overview of your data. Try it out below. The argument `include='all'` can be used for the summary statistics of all columns. 

In [14]:
## TRY OUT: get an overview of your data.


____________________________

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-bottom:-20px"></div>

## 6. Sorting and grouping a DataFrame

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-top:5px;"></div>

When using data, it can be very useful to sort the data according to a certain value. Sorting the DataFrame can be done by: `df.sort_values(by = column)`. Additionally, the argument `ascending=False` can be added, to sort the values in descending order. The index of the dataframe can also be reset, by using the argument `.reset_index()`.

In the next cell, try it out for the number of news articles.

In [10]:
## TRY OUT: sort the dataframe by number of news articles


The rows in a DataFrame can also be grouped. This can be done by using `df.groupby()`. Try it out below for genre.

In [11]:
## TRY OUT: group the data by genre.


___________________________

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-bottom:-20px"></div>

## 7. Adding a column by calculation

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-top:5px;"></div>

New values can also be calculated from the dataframe. This can be done by naming the new colum, followed by the calculation. An example can be seen below.

In [12]:
movies['NoWinNominations'] = movies['nrOfWins']-movies['nrOfNominations']
movies

Unnamed: 0,title,year,lifetime_gross,ratingInteger,ratingCount,duration,nrOfWins,nrOfNominations,nrOfPhotos,nrOfNewsArticles,...,Romance,SciFi,Short,Sport,TalkShow,Thriller,War,Western,Unnamed: 37,NoWinNominations
0,METROPOLIS,1928,1236166,8,81007,9180,3,4,67,428,...,0,1,0,0,0,0,0,0,,-1
1,CITY LIGHTS,1931,19181,9,70057,5220,2,0,38,187,...,1,0,0,0,0,0,0,0,,2
2,MODERN TIMES,1936,163577,9,90847,5220,3,1,44,27,...,0,0,0,0,0,0,0,0,,2
3,GONE WITH THE WIND,1939,198676459,8,160414,14280,10,6,143,1263,...,1,0,0,0,0,0,1,0,,4
4,THE WIZARD OF OZ,1939,22342633,8,209506,6120,6,12,126,2363,...,0,0,0,0,0,0,0,0,,-6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3745,AN INCONVENIENT TRUTH,2006,24146161,8,56014,6000,27,6,62,756,...,0,0,0,0,0,0,0,0,,21
3746,ADRIFT IN MANHATTAN,2007,2099,5,1328,5460,3,1,7,7,...,0,0,0,0,0,0,0,0,,2
3747,HOSTEL: PART II,2007,17609452,5,57934,5640,0,0,46,232,...,0,0,0,0,0,0,0,0,,0
3748,KALYUG,2005,1435,7,1250,7560,0,0,0,0,...,0,0,0,0,0,0,0,0,,0


________________

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-bottom:-20px"></div>

## 8. Delete data for optimization 

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-top:5px;"></div>

When a dataframe is very big, there are often a lot of values that are unnessecary in your research. To this extent, these values can be deleted. This can be done by the function `df.drop()`. It can be done for one row, several rows, one column and several columns. 

In the cell below, try it out for the number of photos by creating a new DataFrame.

In [13]:
## TRY OUT: drop the column of number of photos by creating a new DataFrame.


_________________

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-bottom:-20px"></div>

## 9. Working with NaN values

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-top:5px;"></div>

There are often values which are 'NaN'. This means that this value is missing. There are several options in this case. You can give these a values, for example 0 or 1. Or these values can be dropped.

To drop these values: `df.dropna()`. When the argument `how='any'` is used, the row is dropped when any value is missing. When the argument `how='all'` is used, the row is dropped when all values are misisng. Try it out below

In [14]:
## TRY OUT: make a new DataFrame where there are no NaN values.


To replace the missing values the function, the function `df.fillna()` can be used. Try this out for the title below by creating a new DataDrame. 

In [15]:
## TRY OUT: make a new DataFrame with NaN values changed for title.


________________________

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-bottom:-20px"></div>

## 10. Plotting DataFrame

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-top:5px;"></div>

Plotting can give useful insight on your Dataframe. Here we will learn you the basics of plotting. 
First of all, matplotlib and Seaborn need to be imported. This has already been set for you in the cell below. Also, to make plots appear in your notebook, a line is added.

In [16]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The general function for plotting is: `df.plot(kind='...',...)`.
Different kinds can be used, such as:
* line: also the default setting.
* area: area is filled under the line.
* bar: vertical barchart.
* barh: horizontal barchart.
* hist: histogram. With argument `bins=n`, the number of bins can be adjusted.
* box: boxplot. With argument `by=...`, it will be grouped by given columns.
* density: densityplot.
* scatter: scatterplot. 

When you don't remember which plots are available with which arguments, you can always get help with putting `?` after your command. ----> dit werkt niet even naar kijken.

Below you can run the help command. 

In [17]:
df.plot?

Object `df.plot` not found.


Next, make some plots for yourself.

In [18]:
## TRY OUT: make several plots of the movies DataFrame.
