# Initiate Setup

Load necessary packages.

In [None]:
import pandas as pd # for data transformations
import sklearn # for some general machine learning and other transformations

from langdetect import detect # for language detection

I'm using Youtube video data I extracted for one of my old university projects, containing video metadata and basic stats. This made possible through the [Youtube Data API](https://developers.google.com/youtube/v3/getting-started).

In [None]:
df = pd.read_csv('video_data.csv')

In [None]:
df.head()

# Introduction

## Reductive crash course on data types on structured data

What are categorical variables?

Structured Data can be thought of as two types: numeric (integers, floats) and categorical (strings, booleans*). Categorical variables are either nominal (unordered ex. colours: red, blue, yellow) or ordinal (good, better, best). Some cases like days of the week can be represented as either a category or a number, depending on the application (i.e. if you're using the variable as a category or you're using difference in time as the variable).

Mathematical operations (+ - / *) don't make much sense with nominal categorical variables. How do we add or subtract cats and dogs? 

Ordinal variables can accept mathematical operations, __but__ we need to consider if the order is dictated by a fixed interval or a subjective one. Ex: __good__ is how much better than __better__?

*Note: booleans can also do math as they can be treated as 1 if true and 0 if false)

## Why does this matter?

A lot of machine learning techniques (ex. linear methods like regression, support vector machines, and neural networks) use mathematical operations to generate predictions. Sometimes these methods yield fairly good results; however, _sometimes_ is unreliable. 

Knowing math doesn't fully make sense when we perform them on certain categorical variables, we need to consider ways to either:
- transform categoricals into numerics, or
- use techniques that can use categoricals directly

# Looking at the data

## Categories mistaken for numeric

We can see there are columns here already marked as potential categorical variables through the __object__ type, but some categories like __category_id__ is written as an __int__ because it looks like a number. 

In [None]:
df.dtypes

We can cast category_id as a category object or a string object depending on what we want to do. For now, I just want to see it as a string.

In [None]:
df['category_id'] = df['category_id'].astype('string')

In [None]:
df.dtypes

## Unique strings

I find unique values to be an exploratory checkpoint for categorical variables, because you can find highly unique variables, which would make it hard for the machine to learn meaningful patterns. Personally, if it's over 80% unique for the whole dataset, I find it suspicious. 

## Where's the correlation?
Usually, for numeric variables, we're able to draw correlation matrices showing variables that have high correlations with each other, which can indicate dependencies. Because we're dealing with categorical variables, the same analysis is not afforded to us. However, we can use the [Predictive Power Score (PPS)](https://pypi.org/project/ppscore), which aims to answer similar questions.

One of the ways to do it is by running pairs of PPS for each feature combination, but for this exercise, we can just semantically remove dependencies and redundancies.