We need to alter the structure of the data as it comes out of the database. The original structure is flat, allowing for maximum flexibility for cases of adding, removing or updating the widget enum. As such, we originally get one entry per widget with the type and weight specified. The most convenient structure for modeling is going to be one entry per stats entry, with each widget type listed as a column and their respective weights being the value.

Load dataframe and print out the list of widget types in the data

In [1]:
import pandas as pd

df = pd.read_csv("~/Repositories/datasets/cmsdata/cms_data.visited.csv")
df.type.unique()

array([18, 19, 15, 14, 22, 17, 21, 54, 24, 25, 23, 20])

Now we add a column for each widget type

In [2]:
for w in df.type.unique():
    df[w] = 0

Now update the columns with the values per stats entry

In [3]:
counter = 0
df_len = len(df.stats.unique())
for s in df.stats.unique():
    counter = counter + 1
    if counter % 1000 == 0:
        print("Updating "+str(counter)+" of "+str(df_len))
    for widget in ['24', '22', '15', '14', '54', '21', '25', '17', '18', '19', '23', '20']:
        df.loc[df.stats==s, str(widget)] = df[(df.stats==s) & (df.type==int(widget))].weight

Updating 1000 of 238349
Updating 2000 of 238349
Updating 3000 of 238349
Updating 4000 of 238349
Updating 5000 of 238349
Updating 6000 of 238349
Updating 7000 of 238349
Updating 8000 of 238349
Updating 9000 of 238349
Updating 10000 of 238349
Updating 11000 of 238349
Updating 12000 of 238349
Updating 13000 of 238349
Updating 14000 of 238349
Updating 15000 of 238349
Updating 16000 of 238349
Updating 17000 of 238349
Updating 18000 of 238349
Updating 19000 of 238349
Updating 20000 of 238349
Updating 21000 of 238349
Updating 22000 of 238349
Updating 23000 of 238349
Updating 24000 of 238349
Updating 25000 of 238349
Updating 26000 of 238349
Updating 27000 of 238349
Updating 28000 of 238349
Updating 29000 of 238349
Updating 30000 of 238349
Updating 31000 of 238349
Updating 32000 of 238349
Updating 33000 of 238349
Updating 34000 of 238349
Updating 35000 of 238349
Updating 36000 of 238349
Updating 37000 of 238349
Updating 38000 of 238349
Updating 39000 of 238349
Updating 40000 of 238349
Updating 

Double check our work...

In [4]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,id,stats,org,form,weight,type,id.1,visits,...,15,14,54,21,25,17,18,19,23,20
0,4,4,13186962,19773294,19428,58422,201,18,19773294,9,...,,,,,,,201.0,,,
1,11,11,7361918,19041869,144,118,202,19,19041869,1,...,,,,,,,,202.0,,
2,26,26,3328278,18535351,431145,832619,103,15,18535351,11,...,103.0,,,,,,,,,
3,49,49,2280741,18403727,438285,879703,105,14,18403727,79,...,,105.0,,,,,,,,
4,68,68,4242953,18650402,175,875632,107,15,18650402,2,...,107.0,,,,,,,,,


We now have one entry per widget per stats entry. The given widget entry will have it's weight and all other widget columns are populated with a NaN. We want to collapse these values into one entry per stats, dropping the NaN values.

In [5]:
df = df.groupby('stats').max()

Double check our work...

In [6]:
df.head()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,id,org,form,weight,type,id.1,visits,mobile_visits,...,15,14,54,21,25,17,18,19,23,20
stats,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
18116752,18587247,700320,100,20450,59442,202,54,18116752,7,0,...,104.0,105.0,106.0,107.0,108.0,200.0,201.0,202.0,,
18116762,22948610,948610,200,1527,1510,202,54,18116762,1,0,...,104.0,105.0,106.0,,108.0,200.0,201.0,202.0,,
18116769,18921829,921829,270,340,312,202,54,18116769,1,0,...,104.0,105.0,106.0,107.0,108.0,200.0,201.0,202.0,,
18116780,20264967,948788,384,430140,834664,202,54,18116780,2,0,...,104.0,105.0,106.0,107.0,108.0,200.0,201.0,202.0,,
18116782,13718672,987410,404,430140,833686,202,54,18116782,7,0,...,104.0,105.0,106.0,107.0,108.0,200.0,201.0,202.0,,


And now that we've waited a painfully long time to make this happen, write the reformatted dataframe to CSV so that we can never do it again.

In [7]:
df.to_csv("~/Repositories/datasets/cms_data.restructured.csv")