## Manupulating data using pandas

# Introduction to Pandas

In this section of the course we will learn how to use pandas for data analysis. 
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language
You can think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order:

* Introduction to Pandas
* Series
* DataFrames
* Missing Data
* GroupBy
* Merging,Joining,and Concatenating
* Operations
* Data Input and Output

First you must have pandas library which can be installed using this command

In [None]:
!pip install pandas

Import Pandas library using this command

In [None]:
import pandas as pd

## Series

The first main data type we will learn about for pandas is the Series data type. 

A Series is very similar to a NumPy array (it is built on top of the NumPy array object). 

What differentiates the NumPy array from a Series? 

1) is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. 

2) It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

In [None]:
import numpy as np
import pandas as pd

You can convert a list,numpy array, or dictionary to a Series:

In [None]:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

Using Lists

In [None]:
pd.Series(data=my_list)

In [None]:
pd.Series(data=my_list,index=labels)

NumPy Arrays

In [None]:
pd.Series(arr)

In [None]:
pd.Series(arr,labels)

Dictionary

In [None]:
pd.Series(d)

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information

In [None]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])

In [None]:
ser2 = pd.Series([6,7,8,9],index = ['USA', 'Germany','Italy', 'Japan'])

In [None]:
ser1

In [None]:
ser2

In [None]:
ser1[0]

In [None]:
ser2[3]

Operations are then also done based off of index:

In [None]:
ser1 + ser2

## DataFrames

One basic structure that you get with pandas is a data frame. A data frame is a two dimensional grid, rather similar to a relational database table except in memory.

DataFrames are the workhorse of pandas and are directly inspired by the R programming language.

We can think of a DataFrame as a bunch of Series objects put together to share the same index

In [None]:
from numpy.random import randn
np.random.seed(101)

In [None]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

In [None]:
df

## Selection and Indexing

How to grab data from a DataFrame

In [None]:
df['W']

In [None]:
# Pass a list of column names
df[['W','Z']]

DataFrame Columns are just Series

In [None]:
type(df['W'])

**Creating a new column:**

In [None]:
df['new'] = df['W'] + df['Y']

In [None]:
df

**Removing Columns**

In [None]:
df.drop('new',axis=1) 

# axis = 1 is referring to column

but the 'new' column not permenantly deleted from memory

In [None]:
df

Not inplace unless specified!

In [None]:
df.drop('new',axis=1,inplace=True)

In [None]:
df

**Removing Rows**

In [None]:
df.drop('E',axis=0)

**Selecting Rows**

In [None]:
df.loc['A']

Or you can select based off of position instead of label 

In [None]:
df.iloc[2]

**Selecting subset of rows and columns **

In [None]:
df.loc['B','Y']

In [None]:
df.loc[['A','B'],['W','Y']]

**Example 1**

In [None]:
##{key : list[]:value[]}

data = {'state': ['Jakarta', 'Jakarta', 'Jakarta', 'Selangor', 'Selangor','Kelantan','Kelantan'],
       'year': [2000, 2001, 2002, 2001, 2002, 2001, 2002],
       'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 1.2, 3.2]}
frame = pd.DataFrame(data)
frame

In [None]:
frame = pd.DataFrame(data, columns=['year', 'state', 'pop'])
frame

**Adding new column**

In [None]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                     index = ['one', 'two', 'three', 'four', 'five', 'six','seven'])
frame2

**Selecting column**

In [None]:
frame2.columns

In [None]:
frame2['state']

In [None]:
frame2.year

In [None]:
frame2.loc['two'] 

 loc is location
 will list all elements under loc [two].
 loc will call base on assignee name

In [None]:
frame2.loc['two','state'] 

In [None]:
## iloc  base on index

frame2.iloc[1,1]

In [None]:
frame2.loc['two':,:'state']

In [None]:
frame2

**The debt value is NaN. We can assign value for 'debt'**

In [None]:
frame2['debt'] = 16.5
frame2

In [None]:
frame2['debt'] = np.arange(7.)
frame2

In [None]:
frame2['debt'] = [10,20,15,13,11,67,87]
frame2

In [None]:
'Jakarta' in frame2.columns 

# Use Case Exercise

In [None]:
import numpy as np
import pandas as pd

**DATA 1**

Call data from Local Drive (downloaded files)

In [None]:
from google.colab import files
data_to_load = files.upload()

In [None]:
import io
df = pd.read_csv(io.BytesIO(data_to_load['data2.csv']))

**Checking Top 10 and bottom 10 data**

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.dtypes

In [None]:
df.shape

In [None]:
pd.isnull(df)

In [None]:
newdf1=df[['COUNTRY','POPULATION','CONTINENTS','IND_DAY']]
newdf1

In [None]:
newdf1['CONTINENTS'].fillna('transcontinental')

In [None]:
newdf1

#no changes in Rusia

In [None]:
newdf1['CONTINENTS'].fillna('Transcontinental', inplace=True)
newdf1

In [None]:
newdf1.dtypes

**How to change date format**

In [None]:
newdf1['IND_DAY']=pd.to_datetime(newdf1['IND_DAY'])
newdf1

In [None]:
newdf1.dtypes

**How to fill in missing date**

In [None]:
newdf1['IND_DAY'].fillna(pd.Timestamp("20210423"))

In [None]:
newdf1['IND_DAY'].astype(str).replace({'NaT': "No date"})

In [None]:
newdf1['IND_DAY'].fillna(value = 'No date')

In [None]:
newdf1['IND_DAY'].fillna(value = 'No date', inplace = True)
newdf1

References:<br>
[1] https://stackoverflow.com/questions/42818262/pandas-dataframe-replace-nat-with-none<br>
[2] https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html<br>
[3] https://stackoverflow.com/questions/32327314/how-to-rearrange-a-date-in-python

**Add new data set call 'data3'**

In [None]:
from google.colab import files
data_to_load = files.upload()

In [None]:
import io
df1 = pd.read_csv(io.BytesIO(data_to_load['data3.csv']), encoding='latin-1')

In [None]:
#df1 = pd.read_csv('data3.csv')
df1

In [None]:
df1.dtypes

In [None]:
df1.shape

**Change Data Type** <br>
change cases and death to float

In [None]:
df1['deaths'] = df1['deaths'].str.replace(',','')
df1['deaths'] = df1.deaths.astype(float)
df1['cases'] = df1['cases'].str.replace(',','')
df1['cases'] = df1.cases.astype(float)

In [None]:
df1.dtypes

In [None]:
df.head(2)

In [None]:
df1.head(2)

**Setting header to data case**

Different way writing the header.
Let's change it.

In [None]:
# .capitalize to change first letter as capital letter.
df1.columns=df1.columns.str.capitalize()  
df1.head(2)

In [None]:
df1.columns=df1.columns.str.upper()  

#.upper() to change header to uppercase 

df1.head(2)

In [None]:
df1.shape

In [None]:
df.shape

## Merge 

Pandas provides a single function, merge(), as the entry point for all standard database join operations between DataFrame or named Series objects.

**MERGE** combining data on common columns or indices.

You can achieve both many-to-one and many-to-many joins with merge()

![Mergeconcept](https://files.realpython.com/media/join_diagram.93e6ef63afbe.png)

When gluing together multiple DataFrames, you have a choice of how to handle the other axes (other than the one being concatenated). This can be done in the following two ways

Take the union of them all, join='outer'. This is the default option as it results in zero information loss.

Take the intersection, join='inner'.

In [None]:
merge_df=pd.merge(df, df1)
merge_df

By default, how = inner, which will merge only match data.

In [None]:
merge_df.shape

In [None]:
merge_df.CONTINENTS=merge_df.CONTINENTS.replace(['N.America','S.America'],['North America','South America'])
merge_df

In [None]:
test1merge_df=pd.merge(df, df1, how='inner')
test1merge_df

In [None]:
test1merge_df.shape

In [None]:
merge_df=pd.merge(df, df1, how='outer')
merge_df
# will merge all data

In [None]:
merge_df.shape

In [None]:
# Let's try how='left' or 'right'

test1=pd.merge(df, df1, how='left')
test1

how='left', will merge base on left file, in this example is df

In [None]:
test2=pd.merge(df, df1, how='right')
test2

how='right', will merge base on right file, in this example is df1

## Concatenating

With concatenation, your datasets are just stitched together along an axis — either the row axis or column axis.

In [None]:
concat_df=pd.concat([df, df1], axis=1)
concat_df

Let's try append the data

Let's call add new dataset call data4

In [None]:
from google.colab import files
data_to_load = files.upload()

In [None]:
import io
df2 = pd.read_csv(io.BytesIO(data_to_load['data4.csv']))
df2

In [None]:
test3 = df.append(df2, ignore_index=True, sort=False)
test3

data from df 1 and data 4 are combine at row level

**LET'S MOVE TO GROUPBY**

In [None]:
df.head(2)

In [None]:
df1.head(2)

In [None]:
df2.head()

In [None]:
merge_df.head()

In [None]:
merge_df.shape

In [None]:
merge_df.dtypes

**CONTINENT and REGION actually referring to the same thing.**
Let's try to fix it

Checking the elements

In [None]:
reg=merge_df.groupby('REGION').sum()
reg

In [None]:
con=merge_df.groupby('CONTINENTS').sum()
con

In [None]:
merge_df.CONTINENTS=merge_df.CONTINENTS.replace(['N.America','S.America'],['North America','South America'])
merge_df

In [None]:
reg=merge_df.groupby('REGION').sum()
reg

In [None]:
con=merge_df.groupby('CONTINENTS').sum()
con

In [None]:
df = df.rename(columns={'CONTINENTS': 'REGION'})
df.head()

In [None]:
df1.head()

In [None]:
df101=pd.merge(df, df1)
df101

In [None]:
reg=df101.groupby('REGION').sum()
reg