# Formatting dirty data
Data Analysis Python Pandas Data Manipulation

Suppose you have the following 
[dataset](https://docs.google.com/spreadsheets/d/1DrvkAWnO1psWkFN1YVt891sHe4yHl4ljNPUVlCsI95M/edit#gid=0)
which contains which contains (1st tab) a list of items purchased by a given user, 
(2nd tab) a mapping which maps the item_id to the item name and price.

Can you format the data into a matrix with users in rows 
and the items they purchased into columns along with the frequency of the purchase for each type of item?

For example, if we have a user with the following row:
```
user_id 	ids
12345 	1, 4, 4, 3, 5, 5, 5
```

We would want the output to look like the following:
```
user_id     1 	2 	3 	4 	5
12345       1 	0 	2 	2 	3
```

In [1]:
import pandas as pd

filename = 'q143_data.csv'
df = pd.read_csv(filename)
df.head()

Unnamed: 0,user_id,id
0,222087,2726
1,1343649,64717
2,404134,1812232227433820351
3,1110200,923220264737
4,224107,"31,18,5,13,1,21,48,16,26,2,44,32,20,37,42,35,4..."


In [2]:
# 2 steps: 1) break down id1,id2 into 2 rows, then 2) pivot df

# step 1: https://stackoverflow.com/a/28182629
unpivoted_df = (
    pd.DataFrame(
        df['id'].str.split(',').tolist(), 
        index=df['user_id']
    )
    .stack()
    .reset_index()
    .rename(columns={0:'item_id'})
    [['user_id','item_id']]
)
unpivoted_df.head()

Unnamed: 0,user_id,item_id
0,222087,27
1,222087,26
2,1343649,6
3,1343649,47
4,1343649,17


In [3]:
counts_df = (
    unpivoted_df
    .groupby(['user_id','item_id'])
    .size()
    .reset_index(name='counts')
    .sort_values(by='counts', ascending=False) # just to show that some users bought the same item multiple times
    .reset_index(drop=True)
)

print('shape:', counts_df.shape)
print('number of item_id:', counts_df[['item_id']].nunique()[0])
print('number of user_id:', counts_df[['user_id']].nunique()[0])
counts_df.head()

shape: (290558, 3)
number of item_id: 48
number of user_id: 24885


Unnamed: 0,user_id,item_id,counts
0,599172,39,5
1,1198106,45,5
2,917199,18,5
3,920002,23,5
4,269335,2,5


In [4]:
# step 2: pivot
pivoted_df = counts_df.set_index(['user_id','item_id']).unstack(level=-1, fill_value=0)
# column wrangling ...
pivoted_df.columns = pivoted_df.columns.droplevel(0) # remove column multi-index
pivoted_df.columns = [f'item0{i}' if int(i)<10 else f'item{i}' for i in pivoted_df.columns]
pivoted_df = pivoted_df[sorted(pivoted_df.columns)]
pivoted_df.head()


Unnamed: 0_level_0,item01,item02,item03,item04,item05,item06,item07,item08,item09,item10,...,item39,item40,item41,item42,item43,item44,item45,item46,item47,item48
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
47,0,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,1,1,1,0,0
68,0,0,0,0,0,1,0,0,0,1,...,1,0,0,1,0,0,0,0,0,0
113,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,1,0,0
123,0,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
223,1,1,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0


In [5]:
# check
assert int(pivoted_df.query('user_id==599172')['item39']) == 5
print('passed')

passed
