# Module 2: Practical Tools

## Lecture 1: PANDAS

# Anaconda - Open-source Python/R Distribution for Scientific Computing
#### 1. Includes Python & R,  many packages such as pandas, numpy, scikit-learn, tensorflow
For installation of Anaconda, Refer the tutorial at: 
https://www.datacamp.com/community/tutorials/installing-anaconda-windows
#### 2. Includes the conda package, dependency and environment manager for many languages 
#### 3. Scientific Computing: Data Science, Machine Learning, Deep Learning, Statistical Analysis
#### 4. Platform agnostic - Windows, MacOS, Linux


## Jupiter notebook is a web app for creating and sharing computational documents. 
#### It is pre-installed in Anaconda.
#### Jupiter notebook - live code, textual descriptions, equations, visualization
#### Launch Jupiter notebook from Anaconda


## Introduction to Pandas Library

pandas - a powerful data analysis and manipulation library for Python
=============================================================

1. **pandas** is derived from **PAN**el **DA**ta
2. It is a Python package providing expressive data structures designed to make working with "relational" and "labeled" data both easy and intuitive. 
3. It can:
    - Load, merge or seggregate data - series, Dataframes, JSON, CSV, 
    - Clean data - complete missing values, remove duplicates, reformat 
    - Manipulate data, perform data analytics, provide time-series functions
    - Analyse, correlate and Visualize data
4. Lets get started....Import pandas library by using "import pandas" command.

In [None]:
import pandas as pd

## 1) Pandas Series

* Series is a **one-dimensional**, labeled and ordered array capable of holding any (uniform) data type (integers, strings, floating point numbers, python objects etc.). 

* The axis labels are collectively referred to as the index. 

* The basic method to create a Series is to call :-
    
    pd.Series(data,index = index)
    
    The passed **index** is a list of axis labels.

#### Creating Series :-
    
    1) From Scalar Value 
    
    2) From Dictionary

## 1.1) Scalar Series 

*** Scaler series must specify both data and index***

In [None]:
s1 = pd.Series(data = [9.6,6.2,3.2,1.1,2.3,1], index = ['a','b','c','d','e','f'])
print(s1)

*** If single data is specified, it will be repeated to match index length***

In [None]:
s2 = pd.Series(data = 9.6, index = ['a','b','c','d','e','f'])
print(s2)

#### A series is like a vector. Vectorized operations with Series:

In [None]:
s1+s1

In [None]:
s2**2

## 1.2. Series from a Dictionary

*** When the data is a dictionary {key: value}, values become members of the series, keys become their index.*** 

*  When an index is not passed, the Series index will be ordered by the dictionary's insertion order*

In [None]:
dict_data = {'a':7,'b':5,'c':9,'d':10,'e':11}

In [None]:
Dser1 = pd.Series(dict_data)
print(Dser1)


In [None]:
type(Dser1)

* If an index is passed as list-of-labels, the values in dictionary corresponding to the labels index will be pulled out.*

In [None]:
Dser2 = pd.Series(dict_data,index = ['b','e','a','d','c'])
print(Dser2)

In [None]:
# An extra index will create a NaN (Not a Number) entry

Dser3 = pd.Series(dict_data,index = ['b','e','a','d','c', 'f'])
print(Dser3)

***You can get and set values by referring an index label in any series***

In [None]:
Dser3['a']

In [None]:
Dser3['f'] = 98
print(Dser3)

## 2. Pandas DataFrames

### 2.1) Creating DataFrames

DataFrame is a **2-dimensional** labeled data structure with columns of different types. It just like a spreadsheet or SQL table, or a dict of Series objects. 

***Method of creating a DataFrame***

d = { 
        "column_name" : pd.Series(data,index = ["row_name"])
    }

df = pd.DataFrame(d)

***The row and column label can be accessed respectively by accessing the index and columns attributes.***

#### a) Creating a Dataframe from dict of series 

In [None]:
d = {
        "X": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
        "Y": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
    }

In [None]:
df = pd.DataFrame(d)
df

In [None]:
type(df)

#### For enumerating columns of Dataframe

In [None]:
df.columns

#### For enumerating row indices of Dataframe

In [None]:
df.index

##### For enumerating both axes - (columns and rows) - DataFrame.axes



In [None]:
df.axes

#### For accessing a column

In [None]:
df['Y']

#### For accessing an element

In [None]:
df['X']['c']

#### b) Creating a Dataframe from a list of dictionaries

In [None]:
data = [{'a' : 10,'b' : 20 , 'c' : 25},{'a' : 11 , 'b' : 18}]

In [None]:
df = pd.DataFrame(data)
df

####  c) Creating a Dataframe from an indexed list of dictionaries. Note shape must match original number of dictionaries

In [None]:
df = pd.DataFrame(data, index = ['first','second'])
df

### 2.2) DataFrame Basic functionalities 

In [None]:
df

##### a) Transpose - DataFrame.T

It returns the transpose of the dataframe.

In [None]:
df.T

#### b) DataFrame.dtypes
It return the data types of each columns.

In [None]:
df.dtypes

##### c) DataFrame.empty
It return the boolean value saying whether the object is empty or not.**True** indicates that the object is empty

In [None]:
df.empty

##### d) DataFrame.shape
It return a tuple representing the dimensional of dataframe.It return in the form of tuple(no._of_rows,no_of_columns)

In [None]:
df.shape

##### e) DataFrame.size
It return the number of elements in the dataframe.

In [None]:
df

In [None]:
df.size

#### f) DataFrame.values
It return the actual data in the dataframe as array format.

In [None]:
df.values

## 3. Pandas :- file read and file write operations

#### a) Read CSV file

In [None]:
df = pd.read_csv("workingfile.csv")
df

#### b) read HTML file

In [None]:
url = "http://www.basketball-reference.com/leagues/NBA_2015_totals.html"
df = pd.read_html(url)
df

In [None]:
df = df[0]   # It is the first dataframe in the list of dataframes returned by read_html()
df.head() # If u want to read top 5 rows

In [None]:
df.tail() # If u want to read bottom 5 rows

#### c) read tsv file

In [None]:
df = pd.read_table("test.tsv")
df

#### d) read json file

In [None]:
df = pd.read_json("example.json")
df

#### e) read excel file

In [None]:
df = pd.read_excel("ex1.xlsx")
#df = pd.read_excel("NFHS_5_India_Districts_Factsheet_Data.xls")
df

In [None]:
df.dtypes


#### f) read data from  github repository 

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
data

In [None]:
data.head() #read top 5 rows

In [None]:
data.tail() #read bottom 5 rows

## 4. Initial Data Analysis

In [None]:
## Import dataset 
data = pd.read_csv("data_gov.csv")
data.head() # Show top five row

In [None]:
# Show bottom five rows
data.tail()

In [None]:
## shape of dataset
data.shape

there are 36 no. of rows and 12 no. of columns in our dataset.

In [None]:
## size of data set
data.size

there are 432 no. of items in our dataset.

In [None]:
### Checking the data type of columns
data.dtypes

In [None]:
#### Fetch Information from dataset
data.info()

In [None]:
## Fetching data for a specific column
data["Category of States"]

In [None]:
### collecting unique value for a specific olumn
data["Category of States"].unique()

In [None]:
### fetching 'Scheduled Castes - 2004-05', 'Scheduled Castes - 2007-08', 'Scheduled Tribes - 2004-05', 'Scheduled Tribes - 2007-08' from dataset
data[['Scheduled Castes - 2004-05', 'Scheduled Castes - 2007-08','Scheduled Tribes - 2004-05', 'Scheduled Tribes - 2007-08']]

## Statistical Analysis of dataset using pandas

In [None]:
## mean of "Scheduled Castes - 2004-05" column
data["Scheduled Castes - 2004-05"].mean()

In [None]:
## Variance of Scheduled Castes - 2004-05 column
data["Scheduled Castes - 2004-05"].var()

In [None]:
## maximum value of Scheduled Castes - 2004-05 column
data["Scheduled Castes - 2004-05"].max()

In [None]:
## minimum value of Scheduled Castes - 2004-05 column
data["Scheduled Castes - 2004-05"].min()

#### If u want to find Statistical analysis for a complete dataset

The statistics that are generated by the describe() method:

1) count tells us the number of NoN-empty rows in a feature.

2) mean tells us the mean value of that feature.

3) std tells us the Standard Deviation Value of that feature.

4) min tells us the minimum value of that feature.

5) 25%, 50%, and 75% are the percentile/quartile of each features. This quartile information helps us to detect Outliers.

6) max tells us the maximum value of that feature.

In [None]:
data.describe()

In [None]:
data.describe(include=['O'])