# Data Manipulation using Pandas

Author: Alamsyah Hanza \
[Github](https://github.com/alamhanz)

## Contents

Day 3
- Sorting
- Grouping
- Pandas Apply and Map Function
- Appending, Joining, Merging, Concatenating 2 or more DataFrame
- Pivot and Stack

## Day 3

In [None]:
import pandas as pd
import numpy as np

In [None]:
PATH_DATA = '../../'
d_data = pd.read_csv(PATH_DATA+'telcom_user_extended.csv', usecols = ['customerID', 'gender',
       'tenure',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn',
        'age',
       'minutes_of_call',
       'game_usage_megabytes',
       'average_internet_speed_in_megabytes'])

d_data = d_data.drop_duplicates()
d_data.columns

In [None]:
len(d_data['customerID']),len(d_data['customerID'].unique())

### 1. Sorting

In [None]:
d_data.head()

In [None]:
d_data['tenure'].sort_values().head()

In [None]:
d_data.sort_values(by='tenure').head()

In [None]:
d_data.sort_values(by='tenure', ascending=False).head()

In [None]:
d_data.sort_values(by=['tenure', 'PaymentMethod']).head(20)

In [None]:
d_data.sort_values(by=['tenure', 'PaymentMethod'], ascending = [True,False]).head(20)

### 2. Grouping

In [None]:
help(d_data.groupby('gender'))

In [None]:
d_data.groupby('gender').mean()

In [None]:
d_data.groupby(by='gender').mean().T

In [None]:
dgroup_mean = d_data.groupby(['gender','Churn'])['MonthlyCharges'].mean()

dgroup_mean

In [None]:
# type(dgroup_mean)??

In [None]:
d_data.groupby(['gender','Churn']).agg({'MonthlyCharges' : [np.mean,np.size],'game_usage_megabytes': [np.median]})

### 3. Apply and Map Function

Common functions `apply()` `applymap()` `map()`

<!-- ![image](images/map.png) -->
<img src="images/map.png" alt="Drawing" style="width: 450px;"/>

In [None]:
d_data.head()

In [None]:
d_data.TotalCharges = d_data.TotalCharges.replace(' ',0)
d_data.TotalCharges = d_data.TotalCharges.astype(float)

#### 3.1 `apply()`

In [None]:
d_data['number_of_month'] = d_data.apply(lambda row: np.floor(row.TotalCharges / row.MonthlyCharges), axis = 1)

In [None]:
d_data.head()

In [None]:
d_data['TotalCharges_type'] = d_data.apply(lambda row: 'high' if row.TotalCharges > 4000 else 'low' , axis = 1)

In [None]:
d_data.head()

#### 3.4 `applymap()`

In [None]:
d_data[['game_usage_kilobytes', 'average_internet_speed_in_kilobytes']] = d_data[['game_usage_megabytes', 'average_internet_speed_in_megabytes']].applymap(lambda x: x*1000)

In [None]:
d_data[['customerID','gender','game_usage_kilobytes', 'average_internet_speed_in_kilobytes']].head()

### 4. Append, Concate 2, Join, Merge

    The join method works best when we are joining dataframes on their indexes (though you can specify another column to join on for the left dataframe).
    The merge method is more versatile and allows us to specify columns besides the index to join on for both dataframes. 

In [None]:
d_data.head()

In [None]:
d_data1 = d_data[['customerID','gender','tenure']].sample(4)
d_data1.index = [1,2,3,4]
d_data2 = d_data[['customerID','gender','tenure']].sample(4)
d_data2.index = [1,2,3,4]
d_data3 = d_data[['customerID','gender','tenure']].sample(4)
d_data3.index = [1,2,6,7]
d_data4 = d_data[['customerID','gender','tenure','age']].sample(4)
d_data4.index = [1,2,6,7]

#### 4.1 `append`

In [None]:
# help(pd.DataFrame.append)

In [None]:
d_data1.append(d_data2)

In [None]:
d_data1.append(d_data2,ignore_index=True)

#### 4.2 `concate`

In [None]:
pd.concat([d_data1,d_data2])

In [None]:
pd.concat([d_data1,d_data2],axis =1)

In [None]:
pd.concat([d_data1,d_data3],axis =1)

In [None]:
pd.concat([d_data1,d_data4])

#### 4.3 `join`

In [None]:
d_data_j1 = d_data[['customerID','gender']].sample(10)
d_data_j1

In [None]:
d_data_j2 = d_data[['tenure','TotalCharges']]
d_data_j2.head(30)

In [None]:
d_data_j1.join(d_data_j2, how = 'left') ##--> on index

In [None]:
djoin2 = d_data_j1.join(d_data_j2, how = 'right') ##--> on index
djoin2

In [None]:
djoin2.dropna()

#### 4.4 `merge`

In [None]:
d_data_m1 = d_data[d_data.PaymentMethod.isin(['Bank transfer (automatic)','Credit card (automatic)'])][['customerID','gender','PaymentMethod']].sample(5)
d_data_m1

In [None]:
dg1 = d_data[d_data.PaymentMethod.isin(['Bank transfer (automatic)','Electronic check','Mailed check'])].groupby('PaymentMethod')['minutes_of_call'].median().reset_index()
dg1.columns = ['PaymentMethod','median_minutes_of_call']
dg1

In [None]:
d_data_m1.merge(dg1,on = 'PaymentMethod',how='left')

In [None]:
d_data_m1.merge(dg1,on = 'PaymentMethod',how='right')

In [None]:
d_data.sample(10)

### 5. Pivot and Stack

#### 5.1 `Pivot`

In [None]:
d_data.pivot(index='customerID', columns='PaymentMethod', values='tenure')

In [None]:
# d_data.pivot(index='gender', columns='PaymentMethod', values='tenure')

In [None]:
dpiv1 = d_data.pivot_table(index='gender', columns='PaymentMethod', values='tenure',aggfunc='mean')
dpiv1

#### 5.2 `stack`

In [None]:
dpiv1.stack()

### Exercise

    1. Sorting from highest Average 'TotalCharges' group by ['gender','Churn','PaymentMethod']
    2. Apply x*60 on minutes_of_call (as second_of_call) then calculate avg minutes_of_call and avg second_of_call with grouping ['PaymentMethod']
    3. 
    4.