# Objective : Shaping & Structuring
<hr>

1. Pivoting
2. Pivot Tables
3. Stacking & Unstacking
4. Melting
5. GroupBy
6. Cross Tab
7. Tiling
8. Computing Dummy Variables
9. Factorize
10. Exploding Data

<hr>

In [1]:
import pandas as pd
import numpy as np
gap_data = pd.read_csv('../Data/gapminder-FiveYearData.csv')

In [18]:
gap_data.sample(10)

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
430,Djibouti,2002,447416.0,Africa,53.373,1908.260867
830,Korea Dem. Rep.,1962,10917494.0,Asia,56.656,1621.693598
1414,South Africa,2002,44433622.0,Africa,53.365,7710.946444
409,Denmark,1957,4487831.0,Europe,71.81,11099.65935
960,Mauritania,1952,1022556.0,Africa,40.543,743.11591
31,Algeria,1987,23254956.0,Africa,65.799,5681.358539
122,Benin,1962,2151895.0,Africa,42.618,949.499064
753,Ireland,1997,3667233.0,Europe,76.122,24521.94713
29,Algeria,1977,17152804.0,Africa,58.014,4910.416756
654,Honduras,1982,3669448.0,Americas,60.909,3121.760794


* Reshaping using dataframes means the transformation of the structure of a table or vector (i.e. DataFrame or Series) to make it suitable for further analysis. We will study 10 techniques for this.

### 1. Pivoting
* Create a new derived table out of a given table.
* Pivot() take three params, all columns. values param can have multiple columns.
* The below table extracts relation between country year & population trend.
* Constraint - There cannot be more than one value corresponding to (country,year) tuple

In [20]:
gap_data.pivot(index='country',columns='year', values=['lifeExp'])

Unnamed: 0_level_0,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp
year,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Afghanistan,28.801,30.33200,31.99700,34.02000,36.08800,38.43800,39.854,40.822,41.674,41.763,42.129,43.828
Albania,55.230,59.28000,64.82000,66.22000,67.69000,68.93000,70.420,72.000,71.581,72.950,75.651,76.423
Algeria,43.077,45.68500,48.30300,51.40700,54.51800,58.01400,61.368,65.799,67.744,69.152,70.994,72.301
Angola,30.015,31.99900,34.00000,35.98500,37.92800,39.48300,39.942,39.906,40.647,40.963,41.003,42.731
Argentina,62.485,64.39900,65.14200,65.63400,67.06500,68.48100,69.942,70.774,71.868,73.275,74.340,75.320
Australia,69.120,70.33000,70.93000,71.10000,71.93000,73.49000,74.740,76.320,77.560,78.830,80.370,81.235
Austria,66.800,67.48000,69.54000,70.14000,70.63000,72.17000,73.180,74.940,76.040,77.510,78.980,79.829
Bahrain,50.939,53.83200,56.92300,59.92300,63.30000,65.59300,69.052,70.750,72.601,73.925,74.795,75.635
Bangladesh,37.484,39.34800,41.21600,43.45300,45.25200,46.92300,50.009,52.819,56.018,59.412,62.013,64.062
Belgium,68.000,69.24000,70.25000,70.94000,71.44000,72.80000,73.930,75.350,76.460,77.530,78.320,79.441


In [22]:
#Error: Since multiple values for tuple (continent,year)
gap_data.pivot(index='continent',columns='year', values=['lifeExp'])

ValueError: Index contains duplicate entries, cannot reshape

### 2. Pivot Table
* Pivot table have solution to the previous problem.
* It have the ability to aggregate overlapping values.
* The previous data have continents which have repeating information from multiple companies
* The aggregate function is sum

In [162]:
gap_data.pivot_table(index='continent',columns='year',values='pop', aggfunc=np.sum)

year,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Africa,237640500.0,264837700.0,296516900.0,335289500.0,379879500.0,433061000.0,499348600.0,574834100.0,659081500.0,743833000.0,833723900.0,929539700.0
Americas,345152400.0,386953900.0,433270300.0,480746600.0,529384200.0,578067700.0,630290900.0,682754000.0,739274100.0,796900400.0,849772800.0,898871200.0
Asia,1395357000.0,1562781000.0,1696357000.0,1905663000.0,2150972000.0,2384514000.0,2610136000.0,2871221000.0,3133292000.0,3383286000.0,3601802000.0,3811954000.0
Europe,418120800.0,437890400.0,460355200.0,481179000.0,500635100.0,517164500.0,531266900.0,543094200.0,558142800.0,568944100.0,578223900.0,586098500.0
Oceania,10686010.0,11941980.0,13283520.0,14600410.0,16106100.0,17239000.0,18394850.0,19574420.0,20919650.0,22241430.0,23454830.0,24549950.0


* Adding margins for getting cumulative information
* Margin names can also be added

In [27]:
gap_data.pivot_table(index='continent',columns='year',values='pop', aggfunc=np.sum,margins=True, margins_name='Total')

year,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007,Total
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Africa,237640500.0,264837700.0,296516900.0,335289500.0,379879500.0,433061000.0,499348600.0,574834100.0,659081500.0,743833000.0,833723900.0,929539700.0,6187586000.0
Americas,345152400.0,386953900.0,433270300.0,480746600.0,529384200.0,578067700.0,630290900.0,682754000.0,739274100.0,796900400.0,849772800.0,898871200.0,7351438000.0
Asia,1395357000.0,1562781000.0,1696357000.0,1905663000.0,2150972000.0,2384514000.0,2610136000.0,2871221000.0,3133292000.0,3383286000.0,3601802000.0,3811954000.0,30507330000.0
Europe,418120800.0,437890400.0,460355200.0,481179000.0,500635100.0,517164500.0,531266900.0,543094200.0,558142800.0,568944100.0,578223900.0,586098500.0,6181115000.0
Oceania,10686010.0,11941980.0,13283520.0,14600410.0,16106100.0,17239000.0,18394850.0,19574420.0,20919650.0,22241430.0,23454830.0,24549950.0,212992100.0
Total,2406957000.0,2664405000.0,2899783000.0,3217478000.0,3576977000.0,3930046000.0,4289437000.0,4691477000.0,5110710000.0,5515204000.0,5886978000.0,6251013000.0,50440470000.0


### 3. Stacking & Unstacking
* Let us assume we have a DataFrame with MultiIndices on the rows and columns. 
* Stacking a DataFrame means moving (also rotating or pivoting) the innermost column index to become the innermost row index. 
* The inverse operation is called unstacking. It means moving the innermost row index to become the innermost column index. 
* Stacking makes dataframe taller & can yield useful insights.
* Unstacking makes dataframe wider & can yield useful observations.

In [77]:
index = pd.MultiIndex.from_product([[2013, 2014], ['yes','no']],
                                   names=['year', 'death'])
columns = pd.MultiIndex.from_product([['Mumbai', 'Delhi', 'Bangalore'], 
                                      ['two-wheeler', 'four-wheeler']],
                                     names=['city', 'type'])


data = np.random.randint(1,100,(4,6))

accident_data = pd.DataFrame(data, index=index, columns=columns)
accident_data

Unnamed: 0_level_0,city,Mumbai,Mumbai,Delhi,Delhi,Bangalore,Bangalore
Unnamed: 0_level_1,type,two-wheeler,four-wheeler,two-wheeler,four-wheeler,two-wheeler,four-wheeler
year,death,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,yes,24,38,89,72,20,19
2013,no,29,72,43,15,95,17
2014,yes,8,40,82,8,22,73
2014,no,1,19,98,32,12,62


In [78]:
accident_data.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,city,Bangalore,Delhi,Mumbai
year,death,type,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013,yes,four-wheeler,19,72,38
2013,yes,two-wheeler,20,89,24
2013,no,four-wheeler,17,15,72
2013,no,two-wheeler,95,43,29
2014,yes,four-wheeler,73,8,40
2014,yes,two-wheeler,22,82,8
2014,no,four-wheeler,62,32,19
2014,no,two-wheeler,12,98,1


In [79]:
accident_data.unstack()

city,Mumbai,Mumbai,Mumbai,Mumbai,Delhi,Delhi,Delhi,Delhi,Bangalore,Bangalore,Bangalore,Bangalore
type,two-wheeler,two-wheeler,four-wheeler,four-wheeler,two-wheeler,two-wheeler,four-wheeler,four-wheeler,two-wheeler,two-wheeler,four-wheeler,four-wheeler
death,no,yes,no,yes,no,yes,no,yes,no,yes,no,yes
year,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3
2013,29,24,72,38,43,89,15,72,95,20,17,19
2014,1,8,19,40,98,82,32,8,12,22,62,73


In [50]:
countries = ['India','India','US','US','Australia','Australia','Japan','Japan']
gender = ['male','female','male','female','male','female','male','female']

In [53]:
list(zip(countries,gender))

[('India', 'male'),
 ('India', 'female'),
 ('US', 'male'),
 ('US', 'female'),
 ('Australia', 'male'),
 ('Australia', 'female'),
 ('Japan', 'male'),
 ('Japan', 'female')]

In [56]:
index = pd.MultiIndex.from_tuples(list(zip(countries,gender)), names=['country', 'gender'])

In [57]:
index

MultiIndex(levels=[['Australia', 'India', 'Japan', 'US'], ['female', 'male']],
           codes=[[1, 1, 3, 3, 0, 0, 2, 2], [1, 0, 1, 0, 1, 0, 1, 0]],
           names=['country', 'gender'])

In [60]:
fake_phd_data = pd.DataFrame([10,20,13,15,16,20,33,12], index=index, columns=['PhDs'])

In [61]:
fake_phd_data

Unnamed: 0_level_0,Unnamed: 1_level_0,PhDs
country,gender,Unnamed: 2_level_1
India,male,10
India,female,20
US,male,13
US,female,15
Australia,male,16
Australia,female,20
Japan,male,33
Japan,female,12


In [63]:
fake_phd_data.unstack()

Unnamed: 0_level_0,PhDs,PhDs
gender,female,male
country,Unnamed: 1_level_2,Unnamed: 2_level_2
Australia,20,16
India,20,10
Japan,12,33
US,15,13


In [65]:
fake_phd_data.T.stack()

Unnamed: 0_level_0,country,Australia,India,Japan,US
Unnamed: 0_level_1,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PhDs,female,20,20,12,15
PhDs,male,16,10,33,13


### 4. Melting
* Unpivot a DataFrame from wide format to long format.

In [155]:
# Create a pivot table first
res = gap_data.pivot_table(index='continent',columns='year',values='gdpPercap', aggfunc=np.sum).round(1)

In [156]:
res

year,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Africa,65133.8,72032.3,83100.1,106618.9,121660.0,134468.8,129042.8,118698.8,118654.1,123695.5,135168.0,160629.7
Americas,101976.6,115401.1,122538.5,141706.3,162283.4,183800.2,187668.4,194835.0,201123.4,222232.5,232191.9,275075.8
Asia,171451.0,190995.2,189069.2,197048.7,270186.5,257113.4,245326.5,251071.5,285109.8,324525.1,335745.0,411609.9
Europe,169831.7,208890.4,250964.6,304314.7,374387.3,428519.4,468536.9,516429.3,511847.0,572303.5,651352.0,751634.4
Oceania,20596.2,23197.0,25392.9,28990.0,32834.7,34567.9,37109.4,40896.1,41788.1,48048.4,53877.6,59620.4


In [159]:
res.reset_index(inplace=True)

In [160]:
res

year,continent,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
0,Africa,65133.8,72032.3,83100.1,106618.9,121660.0,134468.8,129042.8,118698.8,118654.1,123695.5,135168.0,160629.7
1,Americas,101976.6,115401.1,122538.5,141706.3,162283.4,183800.2,187668.4,194835.0,201123.4,222232.5,232191.9,275075.8
2,Asia,171451.0,190995.2,189069.2,197048.7,270186.5,257113.4,245326.5,251071.5,285109.8,324525.1,335745.0,411609.9
3,Europe,169831.7,208890.4,250964.6,304314.7,374387.3,428519.4,468536.9,516429.3,511847.0,572303.5,651352.0,751634.4
4,Oceania,20596.2,23197.0,25392.9,28990.0,32834.7,34567.9,37109.4,40896.1,41788.1,48048.4,53877.6,59620.4


In [161]:
melted_df = pd.melt(res, id_vars=['continent'])
melted_df

Unnamed: 0,continent,year,value
0,Africa,1952,65133.8
1,Americas,1952,101976.6
2,Asia,1952,171451.0
3,Europe,1952,169831.7
4,Oceania,1952,20596.2
5,Africa,1957,72032.3
6,Americas,1957,115401.1
7,Asia,1957,190995.2
8,Europe,1957,208890.4
9,Oceania,1957,23197.0


In [154]:
melted_df.sort_values(['continent','year']).round(1)

Unnamed: 0,continent,year,value
0,Africa,1952,65133.8
5,Africa,1957,72032.3
10,Africa,1962,83100.1
15,Africa,1967,106618.9
20,Africa,1972,121660.0
25,Africa,1977,134468.8
30,Africa,1982,129042.8
35,Africa,1987,118698.8
40,Africa,1992,118654.1
45,Africa,1997,123695.5


### 5. GroupBy

In [129]:
titanic_data = pd.read_csv('../Data/titanic-train.csv.txt', index_col='PassengerId')

In [134]:
titanic_data.groupby(['Pclass','Sex','Survived']).size()

Pclass  Sex     Survived
1       female  0             3
                1            91
        male    0            77
                1            45
2       female  0             6
                1            70
        male    0            91
                1            17
3       female  0            72
                1            72
        male    0           300
                1            47
dtype: int64

In [135]:
gap_data.groupby(['continent']).lifeExp.mean()

continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Oceania     74.326208
Name: lifeExp, dtype: float64

In [153]:
gap_data.groupby(['continent','country']).lifeExp.mean().round(1).sort_values()

continent  country                 
Africa     Sierra Leone                36.8
Asia       Afghanistan                 37.5
Africa     Angola                      37.9
           Guinea-Bissau               39.2
           Mozambique                  40.4
           Somalia                     41.0
           Rwanda                      41.5
           Liberia                     42.5
           Equatorial Guinea           43.0
           Guinea                      43.2
           Malawi                      43.4
           Mali                        43.4
           Nigeria                     43.6
           Central African Republic    43.9
           Gambia                      44.4
           Ethiopia                    44.5
           Congo Dem. Rep.             44.5
           Niger                       44.6
           Burkina Faso                44.7
           Burundi                     44.8
           Eritrea                     46.0
           Zambia                      4

### 6. Cross Tabulations
* CrossTab() is used to compute a cross-tabulation of two (or more) factors. By default crosstab computes a frequency table of the factors unless an array of values and an aggregation function are passed.
* CrossTab is one of the easiest way to get quick results compared to pivot_table & other options
* The question still remains, why even use a crosstab function? The short answer is that it provides a couple of handy functions to more easily format and summarize the data.
* The longer answer is that sometimes it can be tough to remember all the steps to make this happen on your own. The simple crosstab API is the quickest route to the solution and provides some useful shortcuts for certain types of analysis.

In [143]:
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df_raw = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",
                     header=None, names=headers, na_values="?" )

In [146]:
df_raw

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.00,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.00,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.00,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.00,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.00,115.0,5500.0,18,22,17450.0
5,2,,audi,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.40,8.50,110.0,5500.0,19,25,15250.0
6,1,158.0,audi,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.40,8.50,110.0,5500.0,19,25,17710.0
7,1,,audi,gas,std,four,wagon,fwd,front,105.8,...,136,mpfi,3.19,3.40,8.50,110.0,5500.0,19,25,18920.0
8,1,158.0,audi,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.40,8.30,140.0,5500.0,17,20,23875.0
9,0,,audi,gas,turbo,two,hatchback,4wd,front,99.5,...,131,mpfi,3.13,3.40,7.00,160.0,5500.0,16,22,


In [145]:
pd.crosstab(df_raw.make, df_raw.body_style)

body_style,convertible,hardtop,hatchback,sedan,wagon
make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
alfa-romero,2,0,1,0,0
audi,0,0,1,5,1
bmw,0,0,0,8,0
chevrolet,0,0,2,1,0
dodge,0,0,5,3,1
honda,0,0,7,5,1
isuzu,0,0,1,3,0
jaguar,0,0,0,3,0
mazda,0,0,10,7,0
mercedes-benz,1,2,0,4,1


In [147]:
pd.crosstab(df_raw.make, df_raw.num_doors, margins=True, margins_name="Total")

num_doors,four,two,Total
make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
alfa-romero,0,3,3
audi,5,2,7
bmw,5,3,8
chevrolet,1,2,3
dodge,4,4,8
honda,5,8,13
isuzu,2,2,4
jaguar,2,1,3
mazda,7,9,16
mercedes-benz,5,3,8


In [151]:
pd.crosstab(df_raw.make, df_raw.body_style, values=df_raw.price, aggfunc='mean').round(1)

body_style,convertible,hardtop,hatchback,sedan,wagon
make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
alfa-romero,14997.5,,16500.0,,
audi,,,,17647.0,18920.0
bmw,,,,26118.8,
chevrolet,,,5723.0,6575.0,
dodge,,,7819.8,7619.7,8921.0
honda,,,7054.4,9945.0,7295.0
isuzu,,,11048.0,6785.0,
jaguar,,,,34600.0,
mazda,,,10085.0,11464.1,
mercedes-benz,35056.0,36788.0,,33074.0,28248.0


### 7. Tiling
* Transforms continues value into discrete value

In [165]:
titanic_data.fillna({'Age':29}, inplace=True)

In [166]:
titanic_data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [168]:
titanic_data.Age.max()

80.0

In [170]:
age_bucket = pd.cut(x = titanic_data.Age, bins=[0,20,60,80],labels=['kid','adult','old'])
age_bucket

PassengerId
1      adult
2      adult
3      adult
4      adult
5      adult
6      adult
7      adult
8        kid
9      adult
10       kid
11       kid
12     adult
13       kid
14     adult
15       kid
16     adult
17       kid
18     adult
19     adult
20     adult
21     adult
22     adult
23       kid
24     adult
25       kid
26     adult
27     adult
28       kid
29     adult
30     adult
       ...  
862    adult
863    adult
864    adult
865    adult
866    adult
867    adult
868    adult
869    adult
870      kid
871    adult
872    adult
873    adult
874    adult
875    adult
876      kid
877      kid
878      kid
879    adult
880    adult
881    adult
882    adult
883    adult
884    adult
885    adult
886    adult
887    adult
888      kid
889    adult
890    adult
891    adult
Name: Age, Length: 891, dtype: category
Categories (3, object): [kid < adult < old]

In [171]:
titanic_data['Age_Bucket'] = age_bucket

In [172]:
titanic_data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_Bucket
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,adult
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,adult
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,adult
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,adult
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,adult


### 8. Computing Dummy Variable
* Labels into one hot vectors
* Convert categorical variable into dummy/indicator variables.

In [175]:
pd.get_dummies(titanic_data.Sex, prefix='Sex')

Unnamed: 0_level_0,Sex_female,Sex_male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,1
2,1,0
3,1,0
4,1,0
5,0,1
6,0,1
7,0,1
8,0,1
9,1,0
10,1,0


### 9. Factorize
* Labels into categorical values
* This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values.

In [176]:
df_raw.sample(10)

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
166,1,168.0,toyota,gas,std,two,hatchback,rwd,front,94.5,...,98,mpfi,3.24,3.08,9.4,112.0,6600.0,26,29,9538.0
138,2,83.0,subaru,gas,std,two,hatchback,fwd,front,93.7,...,97,2bbl,3.62,2.36,9.0,69.0,4900.0,31,36,5118.0
65,0,118.0,mazda,gas,std,four,sedan,rwd,front,104.9,...,140,mpfi,3.76,3.16,8.0,120.0,5000.0,19,27,18280.0
141,0,102.0,subaru,gas,std,four,sedan,fwd,front,97.2,...,108,2bbl,3.62,2.64,9.5,82.0,4800.0,32,37,7126.0
181,-1,,toyota,gas,std,four,wagon,rwd,front,104.5,...,161,mpfi,3.27,3.35,9.2,156.0,5200.0,19,24,15750.0
85,1,125.0,mitsubishi,gas,std,four,sedan,fwd,front,96.3,...,122,2bbl,3.35,3.46,8.5,88.0,5000.0,25,32,6989.0
40,0,85.0,honda,gas,std,four,sedan,fwd,front,96.5,...,110,1bbl,3.15,3.58,9.0,86.0,5800.0,27,33,10295.0
167,2,134.0,toyota,gas,std,two,hardtop,rwd,front,98.4,...,146,mpfi,3.62,3.5,9.3,116.0,4800.0,24,30,8449.0
20,0,81.0,chevrolet,gas,std,four,sedan,fwd,front,94.5,...,90,2bbl,3.03,3.11,9.6,70.0,5400.0,38,43,6575.0
143,0,102.0,subaru,gas,std,four,sedan,fwd,front,97.2,...,108,mpfi,3.62,2.64,9.0,94.0,5200.0,26,32,9960.0


In [177]:
pd.factorize(df_raw.fuel_system)

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
        1, 0, 1, 1, 1, 0, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 1, 1,
        1, 1, 4, 0, 0, 0, 1, 1, 1, 1, 1, 5, 5, 5, 0, 1, 1, 1, 1, 6, 1, 0,
        6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 1, 1, 1, 7, 7, 1, 7, 7, 7, 1, 1, 7,
        7, 1, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 6, 0,
        6, 0, 6, 0, 6, 0, 6, 0, 1, 7, 1, 1, 1, 1, 7, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1,
        1, 1, 1, 1, 6, 6, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0,
        0, 0, 0, 0, 0, 0, 6, 0, 6, 0, 0, 6, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 6, 0]),
 Index(['mpfi', '2bbl', 'mfi', '1bbl', 'spfi', '4bbl', 'idi', 'spdi'], dtype='object'))

### 10. Exploding data
* Transforming values of column containing list-like information ro multiple rows

In [6]:
df = pd.DataFrame({'Name':['Abhi','Mac','Ram'],'Marks':[[87,73,22],[22,11],[44,55,66]]})

In [8]:
df

Unnamed: 0,Name,Marks
0,Abhi,"[87, 73, 22]"
1,Mac,"[22, 11]"
2,Ram,"[44, 55, 66]"


In [9]:
df.explode('Marks')

Unnamed: 0,Name,Marks
0,Abhi,87
0,Abhi,73
0,Abhi,22
1,Mac,22
1,Mac,11
2,Ram,44
2,Ram,55
2,Ram,66
