# Data Analyis of Canva Data
Here, we will be clearning, transforming and modelling the canva data.

In [89]:
import pandas as pd
import matplotlib as mpl
import numpy as np
from IPython.display import display, HTML
import pycountry

mau_df = pd.read_csv("mau_by_plan_type.csv")
device_tiers_df = pd.read_csv("device_tiers.csv")
canva_templates_df = pd.read_csv("canva_templates.csv")
mau_df.head()


Unnamed: 0,MONTH_END_DATE,PRIMARY_PLAN_TYPE,COUNTRY_CODE,Monthly Active Users
0,2020-01-31,Canva Pro,AD,144
1,2020-01-31,Canva Pro,AD,92
2,2020-01-31,Canva Pro - NFP,AD,1
3,2020-01-31,Education,AD,4
4,2020-01-31,Free,AD,1161


## Stage 1: Cleaning
### 1. MAUs: Duplicate Average, Removing Unkowns
We have noticed several duplicate rows in the MAU dataset with differing MAU numbers. Since we cannot assume which entry is correct, and it also doesnt seem to be the result of seperate observations that require aggregation, we will simply average entries with duplicate Month + Plan Type + Country Code.

In [91]:
def code_to_name(code):
    try:
        return pycountry.countries.lookup(code).name
    except:
        return None

# group similar rows together
mau_df = (
    mau_df
      .groupby(['MONTH_END_DATE', 'PRIMARY_PLAN_TYPE', 'COUNTRY_CODE'], as_index=False)
      ['Monthly Active Users']
      .mean()
)

# typecast as int
mau_df['Monthly Active Users'] = np.ceil(mau_df['Monthly Active Users']).astype(int)

# convert code to country name
mau_df['COUNTRY_CODE'] = mau_df['COUNTRY_CODE'].apply(code_to_name)

# rename COUNTRY_CODE to Country for consistency between tables (for joining)
mau_df.rename(columns={'COUNTRY_CODE': 'Country'}, inplace=True)

# reset indexes
mau_df = mau_df[mau_df["PRIMARY_PLAN_TYPE"] != "Unknown"].reset_index(drop=True).sort_values(by = ['Country', 'MONTH_END_DATE', 'PRIMARY_PLAN_TYPE'])
mau_df.head(10)

KeyError: 'COUNTRY_CODE'

### 2. Device Tiers: Matching Table Headers
We notice that in the Device Tiers table, countries are referred to by name rather than code. To preserve consistency, we will convert names to code.

In [None]:
def name_to_code(name):
    try:
        return pycountry.countries.lookup(name).alpha_2
    except:
        return None


device_tiers_df = device_tiers_df.reset_index(drop=True).sort_values(by="Country")
device_tiers_df[['High', 'Mid', 'Low', 'Unknown']] = device_tiers_df[['High', 'Mid', 'Low', 'Unknown']].astype('Int64')
device_tiers_df.head()


Unnamed: 0,Country,High,Mid,Low,Unknown,Total
0,Afghanistan,6491,6277,1948,554,15270
2,Albania,23814,3914,432,1958,30118
3,Algeria,85198,76059,29307,6684,197248
4,American Samoa,407,115,9,94,625
5,Andorra,3565,694,33,588,4880
