<a href="https://colab.research.google.com/github/fathanick/Python-basic/blob/master/02_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Data Manipulation with Pandas

* Pandas is a package built on top of NumPy, and provides an
efficient implementation of a **DataFrame**.
* DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. 
* As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.
* DataFrame can be said to be a combination of a Dictionary object Series that has the same index.

In [0]:
import pandas as pd

###Create DataFrame

A pandas DataFrame can be created using various inputs like:

* Lists
* dict
* Series
* Numpy ndarrays
* Another DataFrame


####Create DataFrame from Lists

In [5]:
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)

   0
0  1
1  2
2  3
3  4
4  5


In [7]:
data = [['Ahmad',10],['Bilal',12],['Choiry',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)

     Name  Age
0   Ahmad   10
1   Bilal   12
2  Choiry   13


####Create a DataFrame from Dict of ndarrays / Lists

All the **ndarrays** must be of **same length**. If index is passed, then the length of the index should equal to the length of the arrays.

If no index is passed, then by default, index will be range(n), where n is the array length.

In [10]:
data = {'Name':['Ali', 'Bilal', 'Muadz', 'Hasan'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print(df)

# Note − Observe the values 0,1,2,3. They are the default index assigned to each using the function range(n).

    Name  Age
0    Ali   28
1  Bilal   34
2  Muadz   29
3  Hasan   42


Create an indexed DataFrame using arrays

In [11]:
data = {'Name':['Ali', 'Bilal', 'Muadz', 'Hasan'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print(df)

        Name  Age
rank1    Ali   28
rank2  Bilal   34
rank3  Muadz   29
rank4  Hasan   42


####Create a DataFrame from List of Dicts

List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

In [12]:
# create a DataFrame by passing a list of dictionaries

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)

# Note − Observe, NaN (Not a Number) is appended in missing areas.

   a   b     c
0  1   2   NaN
1  5  10  20.0


####Create a DataFrame from Dict of Series
Dictionary of Series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed.



In [14]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df)

# Note − Observe, for the series one, there is no label ‘d’ passed, but in the result, for the d label, NaN is appended with NaN.

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4


####Column Selection

In [15]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df ['one'])

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64


####Column Addition

In [16]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column label by passing new series

print("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)

print("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']

print(df)

Adding a new column by passing as Series:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Adding a new column using the existing columns in DataFrame:
   one  two  three  four
a  1.0    1   10.0  11.0
b  2.0    2   20.0  22.0
c  3.0    3   30.0  33.0
d  NaN    4    NaN   NaN


####Column Deletion

In [17]:
# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
   'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print("Our dataframe is:")
print(df)

# using del function
print ("Deleting the first column using DEL function:")
del(df['one'])
print(df)

# using pop function
print("Deleting another column using POP function:")
df.pop('two')
print(df)

Our dataframe is:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Deleting the first column using DEL function:
   two  three
a    1   10.0
b    2   20.0
c    3   30.0
d    4    NaN
Deleting another column using POP function:
   three
a   10.0
b   20.0
c   30.0
d    NaN


####Row Selection, Addition, and Deletion

#####Selection by Label
Rows can be selected by passing row label to a **loc** function.

In [18]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df.loc['b'])

one    2.0
two    2.0
Name: b, dtype: float64


#####Selection by integer location
Rows can be selected by passing integer location to an **iloc** function.

In [19]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df.iloc[2])

one    3.0
two    3.0
Name: c, dtype: float64


#####Slice Rows
Multiple rows can be selected using ‘ : ’ operator.

In [20]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df[2:4])

   one  two
c  3.0    3
d  NaN    4


#####Addition of Rows
Add new rows to a DataFrame using the **append** function. This function will append the rows at the end.

In [21]:
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
print(df)

   a  b
0  1  2
1  3  4
0  5  6
1  7  8


#####Deletion of Rows
Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.

In [22]:
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0
df = df.drop(0)

print(df)

   a  b
1  3  4
1  7  8


###Read CSV Dataset

In [28]:
housing_df = pd.read_csv('/content/sample_data/california_housing_test.csv')
housing_df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.30,34.26,43.0,1510.0,310.0,809.0,277.0,3.5990,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0
...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23.0,1450.0,642.0,1258.0,607.0,1.1790,225000.0
2996,-118.14,34.06,27.0,5257.0,1082.0,3496.0,1036.0,3.3906,237200.0
2997,-119.70,36.30,10.0,956.0,201.0,693.0,220.0,2.2895,62000.0
2998,-117.12,34.10,40.0,96.0,14.0,46.0,14.0,3.2708,162500.0


In [32]:
# View top 5 data
housing_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.3,34.26,43.0,1510.0,310.0,809.0,277.0,3.599,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0


###Reading JSON

In [31]:
ans_df = pd.read_json('/content/sample_data/anscombe.json')
ans_df.head()

Unnamed: 0,Series,X,Y
0,I,10,8.04
1,I,8,6.95
2,I,13,7.58
3,I,9,8.81
4,I,11,8.33


###References

* https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm
* https://medium.com/@yasirabd/pengenalan-numpy-pandas-matplotlib-b90bafd36c0
* https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/