# Introduction to pandas
Based on tutorials by M. Shahid https://github.com/dshahid380/Data-analysis-with-pandas and exercises by G. Samora https://github.com/guipsamora/pandas_exercises

## What is pandas ?

<img src="IMG/pandas_logo.png">

***pandas*** is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like **R**.[pandas website]


### External Information
* Pandas website:  https://pandas.pydata.org/
* Pandas user guide: http://pandas.pydata.org/pandas-docs/stable/user_guide/index.html
* Pandas API documentation: http://pandas.pydata.org/pandas-docs/stable/reference/index.html
* VERY USEFULL: Pandas Cheat Sheet: https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf

## Getting started
To use pandas you need to first import the pandas module in your program

In [3]:
import pandas as pd #naming convention for pandas is pd

### Reading CSV and Excel sheets:

##### d=pd.read_csv("path"):
 * pd.read_csv() is the function to read the CSV(Comma separated values) file from your computer.
 * In the function you have to pass "path" of the CSV file under quote.
 * Store the dataframe in any variable,here i stored it in variable "d".
 * read_csv() function makes the CSV file into dataframe so that you can access it just like a disctionary.

##### d=pd.read_excel("path") : 
 * It is same as the read_csv() but it reads  excel sheet or file.

### Importing Weather data

In [4]:
#get the data
!git clone https://github.com/keuperj/DATA.git

fatal: destination path 'DATA' already exists and is not an empty directory.


In [5]:
d=pd.read_csv('DATA/weather.csv') 
# returning dataframe object
d
#printing dataframe d in table format

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 02:00:00.000 +0200,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 03:00:00.000 +0200,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 04:00:00.000 +0200,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.
...,...,...,...,...,...,...,...,...,...,...,...,...
96448,2016-09-09 19:00:00.000 +0200,Partly Cloudy,rain,26.016667,26.016667,0.43,10.9963,31.0,16.1000,0.0,1014.36,Partly cloudy starting in the morning.
96449,2016-09-09 20:00:00.000 +0200,Partly Cloudy,rain,24.583333,24.583333,0.48,10.0947,20.0,15.5526,0.0,1015.16,Partly cloudy starting in the morning.
96450,2016-09-09 21:00:00.000 +0200,Partly Cloudy,rain,22.038889,22.038889,0.56,8.9838,30.0,16.1000,0.0,1015.66,Partly cloudy starting in the morning.
96451,2016-09-09 22:00:00.000 +0200,Partly Cloudy,rain,21.522222,21.522222,0.60,10.5294,20.0,16.1000,0.0,1015.95,Partly cloudy starting in the morning.


### First interaction with the data 
#### How many rows are  in my DataFrame ?

In [6]:
print(len(d))
# there are 96453 rows

96453


#### Getting the first n-rows of dataset
  * To see the first five rows call head() function with dataframe object. <br>
      for example-  
      <code> d.head() </code> <br>
  * If you want to view first n-rows <br>
      <code> d.head(n) </code>

In [7]:
d.head() # first five rows

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 02:00:00.000 +0200,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 03:00:00.000 +0200,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 04:00:00.000 +0200,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.


In [8]:
d.head(9) # to print first n=9 rows

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 02:00:00.000 +0200,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 03:00:00.000 +0200,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 04:00:00.000 +0200,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.
5,2006-04-01 05:00:00.000 +0200,Partly Cloudy,rain,9.222222,7.111111,0.85,13.9587,258.0,14.9569,0.0,1016.66,Partly cloudy throughout the day.
6,2006-04-01 06:00:00.000 +0200,Partly Cloudy,rain,7.733333,5.522222,0.95,12.3648,259.0,9.982,0.0,1016.72,Partly cloudy throughout the day.
7,2006-04-01 07:00:00.000 +0200,Partly Cloudy,rain,8.772222,6.527778,0.89,14.1519,260.0,9.982,0.0,1016.84,Partly cloudy throughout the day.
8,2006-04-01 08:00:00.000 +0200,Partly Cloudy,rain,10.822222,10.822222,0.82,11.3183,259.0,9.982,0.0,1017.37,Partly cloudy throughout the day.


### Getting the last n-rows of dataset
  * Call the tail() function <br>
     <code> d.tail() </code> <br>
It will show only last five rows of dataframe d. <br>
  * If you want to see last n-rows - <br>
     <code> d.tail(n) </code>

In [None]:
d.tail() #last five rows

In [None]:
d.tail(9) # last 9 rows

## Simple Slicing in a DataFrame
 * Slicing works very similar to Numpy 
 * Suppose you want to get 10 rows of the dataframe ranging from row 20 to 30. <br>
   <code> d[20<b>:</b>31] </code>

In [None]:
d[20:31]

In [None]:
#slicing with step size
d[20:30:2]

In [None]:
#inverse index
d[-15:-10]

### BUT: accesing a single element is different!

In [None]:
d[3]

In [None]:
#use iloc instead
d.iloc[3]

In [None]:
#check with head
d.head()

### Accesing the particular column of dataframe :
 * You can access the particular column of the dataframe just my mentioning its name under quote with dataframe d.<br>
    for example you want to access Humidity column e.i, <br>
    
    <code> d['Humidity'] </code>

In [None]:
d['Humidity'].head(10) 
#Here i have applied head() function just to see only first 10 values because humidity 
#column has too many values

### Finding min, max and average of a column:
  * <b>min() :</b> To find minimum of a column
  * <b>max() :</b> To find maximum of a column <br>
  * <b>mean() :</b> To find average of a column <br>
  
Find minimum and maximum of Humidity column-

In [None]:
d['Humidity'].min() 
#Printing minimum value of Humidity

In [None]:
d['Humidity'].max()
#Print maximum value Humidity

In [None]:
d['Humidity'].mean()
#Print average value Humidty

### Conditional statements
  * <code>d["your column"]["your condition"] </code> <br>
   It will return all the values in which your condition holds true.
#### Examples : <br>
  * Find temp when Humidity is minimum<br>
  
    <code> d['Temperature(C)'][d['Humidity']==d['Humidity'].min() ]</code> <br><br>
    
  * Find temp when Humidity is maximum <br>
  
    <code> d['Temperature(C)'][d['Humidity']==d['Humidity'].max() ] </code>

In [None]:
 d['Temperature (C)'][d['Humidity']==d['Humidity'].min()]

In [None]:
 d['Temperature (C)'][d['Humidity']==d['Humidity'].max()]

In the same way you can apply various condition and analyse it hidden features

### Replacing NaN with specific value
 * Here i am replacing all NaN value with 0.

In [None]:
d.fillna(0,inplace=True)


### Visualization with Pandas
Pandas also has build in visualization and plotting methods

In [None]:
# example simple plot
import matplotlib.pyplot as plt
%matplotlib inline
d['Temperature (C)'].plot()


In [None]:
# histogram
d['Temperature (C)'].plot.hist()


In [None]:
# boxplot over all numerical variables
plt.figure(figsize=(20,10)) #pandas uses plt, so we can use plt methods to manipulate the output
d.boxplot()


In [None]:
#scatter plot
d.plot.scatter('Temperature (C)','Humidity',figsize=(10,5))

In [None]:
#finaly a realy helpfull tool to get a first look at data - takes some time to compute on full data
output=pd.plotting.scatter_matrix(d[1:100],figsize=(20,15))
