# Intro to EDA (Exploratory Data Analysis)

Since this is the first section on EDA. we should talk quickly about the process itself. 

```TBD```

In [None]:
import pandas

# Intro to Variables

## Variable Types

The columns represent the variables in a dataframe or table. 

**2 Types of Variables**
1. Categorical
2. Numerical

In [None]:
# load and show the data
netflix = pandas.read_csv('netflix_kaggle_movies.csv')
netflix.head()

Let's break down the variables. 

1. **id**: This is an id or label. Even though numbers are used to represent the labels, these are **categorical** variables.
2. **type**: This is a category... so it's a **categorical** variable.
3. **title**: This is a **categorical** variable.
4. **country**: This is a **categorical** variable.
5. **release_year**: This is a **numerical** variable.
6. **rating**: This is a **categorical** variable.
7. **duration**: This is a **numerical** variable.

Note: Years can be challenging depending on the context. In this context we can think of it as something we "count", which will become more clear as we move on. 

### Categorical Variables (a.k.a. Qualitative Variables)

There are three types of categorical variables. 

1. **nominal variables**: which describe something
2. **ordinal variables**: which rank or order something
3. **binary variables**: which have two (*and only two*) possible variations.

Ordinal Variables are often the most confusing, because ordering and ranking has an obvious numeric context. However, the numeric context is *incalculable*. For instance, if we are analyzing Olympians on the podium, we don't generally average their place (although, this gets even more complicated.. because we calculate the average position someone places across multiple races. This is a whole different ball of yarn.) 

In [None]:
# Revisit Netflix Again
netflix.head()

Let's revisit the list of variables again. This time we're going to further categorize the variables based on their sub category of categorical variables. 

| **Variable Name** | **Variable Type** | **Sub Type** |
|-|-|-|
| id | categorical | nominal* |
| type | categorical | nominal | 
| title | categorical | nominal | 
| country | categorical | nominal | 
| release_year | numerical | TBD | 
| rating | categorical | ordinal | 
| duration | numerical | TBD | 

```
NOTE: id is ambiguous right now. The easiest example I can think of is database IDs. Some auto-generated database IDs are deliberately sequential. This would make the id an ordinal variable. However, other IDs are UUIDs, which aren't sequential, making them nominal.

Think about the use case. For time series data, we'd rank observations/rows/records based on the time of creation. However, depending on the fidelity of the capability of measuring time as well as these records being captured across a distributed system, uniqueness isn't guaranteed, so an ID might be required in data stores where uniqueness is a requirement of a record.

Sometimes it's important to distinguish between insertion order and actual occurrence. This can also be established by another timestamp, but an auto-generated sequential id would serve the same purpose. (There are arguments that this violates single responsibility) .

As a rule of thumb, I prefer to build databases where the ID is a unique identifier (therefore a nominal variable). This is a generally accepted sensible default.  However, we must always plan for someone making bad, ignorant or rushed decisions! Never assume the ID isn't moonlighting as an ordinal variable. 
```



In [None]:
# Binary for type?? Let's prove it using some of the aggregations we've learned.
netflix.type.nunique()

### Numerical (or Qualitative) Variables

*Grace Hopper* said, "One accurate measurement is worth a thousand expert opinions."

**2 Types of Numerical Variables**
1. **continuous**: (*measurements*). There must be infinitely smaller units of measurement between one unit and the next. Time, distance. 
2. **discrete**: (*counts*). These are "whole" things we count. People, puppies, peanut butter cups.

Let's revisit that netflix data... again. 

In [None]:
netflix.head()

Let's revisit the list of variables again. This time we're going to finish the table based on the qualitative variables. 

| **Variable Name** | **Variable Type** | **Sub Type** |
|-|-|-|
| id | categorical | nominal* |
| type | categorical | nominal | 
| title | categorical | nominal | 
| country | categorical | nominal | 
| release_year | numerical | discrete | 
| rating | categorical | ordinal | 
| duration | numerical | continuous | 

Ok. **Duration** is more obvious, because it is clearly subdivided. **Release_year** is time too, though, right? Yes, however, it was captured as a counting variable (whole numbers), not as continuous time. 

This is important because time and distance can be confusing. Release year isn't being used as a continuous measurement here. For example, 2000 is a "whole number" year, whereas 10/15/2020\:11\:45 is a continuous time value. (like Unix Epoch or something along those lines.) 

Think of it this way. Are you "counting it"? Then it's discrete. Or.. are you "measuring it"? 

### Playing with dtypes (Numerical)

dtypes is a property of a pandas DataFrame that displays all of the data types. Let's see how we did in our table...

**continuous** variables are usually represented as ```float``` data types. 
**discrete** variables are usually represented as ```int``` data types. 

```No, int32/int64 doesn't matter. Size has more to do w/ storage.```

In [None]:
netflix.dtypes

Not bad!

**id** shows up as an ```int```, which we don't hate because it is! It's a discrete number, it just happens to be a label. 

**release_year** is an ```object```. I'd rather see that as an ```int``` (maybe). 

**duration** is a ```float```, which we expected. 

Let's clean this up and run it again. I'd like to set **id** to a ```string```, and **release_year** to an ```int```

In [None]:
# convert id to a string
netflix.id = netflix.id.astype('string')
netflix.dtypes

In [None]:
# convert year to int32
netflix.release_year = netflix.release_year.astype("int32")
netflix.dtypes

In [None]:
# Hmm.. that caused an error... wonder why??? 
netflix.release_year.unique()

Aha!! 

We've run into our first numerical problemo. 'missing' isn't an int, so we need to replace this in order to properly categorize the data. We can't perform calculations on 'missing'. 

For now, let's 
1. replace missing w/ 0
2. re-run the previous code to categorize the data
3. check the ```dtypes```

In [None]:
# NOTE: for nans, we would do this
# netflix.release_year.fillna(0, inplace=True)
#
# Since this is a string 'missing', we can replace it like this:
netflix.release_year = netflix.release_year.replace('missing',0)
netflix.release_year = netflix.release_year.astype("int32")
netflix.dtypes

### Playing with dtypes (categorical)

This continues what we did above, but w/ categorical variables. 

**nominal** variables are usually ```object``` or ```string``` data types. The ```object``` type in pandas is more of a "junk type" than anything else. It is reserved for everything that it can't identify as ```intXX```, ```floatXX```, or ```bool```. Pandas doesn't automatically identify anything as a string. You have to do it explicitly. The primary reason for using ```string``` is to perform string operations like ```lower()```. 

**ordinal** variables are usually incorrectly encoded as ```int``` by pandas, ut they are typically ```object```. 

**binary** variables can be represented as bool, but are often encoded as ```object``` or ```int```. 

Let's go through them one at a time. 

In [None]:
# I've decided I don't want id to be a string...because we can't perform most string operations on a number!
netflix.id = netflix.id.astype('object')
netflix.dtypes

In [None]:
# We defined type as a binary variable earlier, so let's set it to a bool.
netflix.type = netflix.type.astype('bool')
netflix.dtypes

In [None]:
# Title... hm. That really is a string. We might want to do string-y things to it. 
netflix.title = netflix.title.astype('string')
netflix.dtypes

Country and Rating are both strings, but I'm going to leave them alone for now. 
In this case, both Rating and Country represent what would probably be an 'enum' in a codebase, because
there are a finite set of potential values, and I'd like them to all be the same. 

### Pandas 'Category' Data Type

One of the limitations of these data types is that they can't do things like tell us about order or sequencing. Pandas has a categorical method specifically for this. Let's look at the ```rating``` variable in more detail. (Remember.. we said this was **ordinal**)


In [None]:
# what are the unique vales of rating.
netflix.rating.unique()

Ugh. That's a mess. 

Upon **inspection** of the data, we've got to do some cleaning. 
Before we get to that step, let's thing about the sources. 

If the data is external, at best we can notify the stewards, although they may not actually be the owners, so the odds of it changing are low. (It's worth noting that the problem exists if you plan to re-use this data regularly.)

If the data is internal, we need to note the source so that we can fix it and streamline the process in the future. 

*in this case we know the data is external*

**Step 1: ID the problem**
The problem, in this case, is that we have 3 different values that represent the same meaning: 
- nan
- UNRATED
- NOT RATED

We'd have to validate that as best as we can.It's entirely possible that there is a degree of nuance. nan could be a value that was unintentinally left out, UNRATED could be a movie that was never evaluated for a rating (potentionally due to when it was released), and NOT RATED could be a movie that was explicitly not given a rating. 

For right now, we're going to say that we don't care why it wasn't rated. All we care about is that it wasn't. 

**Step 2: handle the discrepancy**
We need to make a decision about how these values are going to be represented. A common representation is 'NR'. 

**Step 3: clean!!!**

In [None]:
# fillna() is the best way to resolve 'nan's
netflix.rating = netflix.rating.fillna('NR')
netflix.rating.unique()

In [None]:
# You might remember the replace() method from the Python Pandas courses. 
## When we are doing a 1:1 find/replace, the syntax is replace(from_value, to_value)
## doing a many:1 replace is a bit more tricky
netflix.rating = netflix.rating.replace(
    dict.fromkeys(['UNRATED','NOT RATED'], 'NR')
)
netflix.rating.unique()

Cool. We're in much better shape now. 

--- 

Before we get to that Categorical data type, I want to take a look at TV-14 vs. PG-13. 

Our data set includes ratings for both TV and Movies. We have a few options about how to handle this. 

The complex way is to manage two rating systems. This means that the rating system is conditional based on the ```type``` variable. This isn't uncommon, but it's naturally more complex. 

**Why** would you do this? 
In many cases, categorization through a binary/bool data type is purposely going to result in analysis of different types or categories. If our job was to compare the ratings systems (TV vs. MPA), then having two different categories for rating would make sense, because we'd likely be looking for comparisons between the two (based on some other confounding variable more than likely). In this case the nuances of the two ratings systems matter. 

Alternatively, we might be looking at something, else and rating is more of a categorical generalization. 

There are two very powerful tools that help us become more decisive. 
1. If you have the raw data, you have the ability to be wrong. This is my favorite tool. I love being wrong. 
2. **I don't know**. If you aren't sure, just make an educated guess, make a call and move forward.

If you are wrong, you're going to find out quickly. Don't fight against that. Let it happen, and enjoy it. 

---

Given that side bar, we don't really know what we're doing with this data yet. Looking at the rest of the ratings, it seems that TV-14 is the only tv rating in the bunch. The rest are all MPA ratings. As a result, to remain consistent, let's find the closest MPA rating.. which happens to be PG-13. Let's swap the values. 

In [None]:
netflix.rating = netflix.rating.replace('TV-14','PG-13')
netflix.rating.unique()

(Yes, cleaning and data analysis should be this tedious. Being open and eager to be wrong quickly, is how we end up ultimately w/ results that are **right**). 

So now let's create a Categorical variable! and order the ratings...

Oops, what's the order? More accurately, where do we put 'NR'. There are a number of folks who things that NR should be put first. I personally think that's a bad idea, because the 'lowest' rating (G) is based on the appropriateness of the content for all audiences. If we put NR below G we'd be suggesting that content that hasn't been evaluated is appropriate for children. Instead, let's put it at the other end. 

```
G, PG, PG-13, R, NR
```

It's critical to justify ordering in a clear (and intuitive manner). When it comes to governance and data dictionaries, we want to use as few words as possible. The heavier our data governance is, the more likely the users get bogged down with the experience. 

In [None]:
# Finally! Categorical!
netflix.rating = pandas.Categorical(
    netflix.rating,
    ['G','PG','PG-13','R','NR'],
    ordered=True
)

# Now what does our data look like...
netflix.rating.unique()

### OHE (One-Hot Encoding)
This is a clever technique for making comparisons. We are essentially decomposing a categorical variable into a new binary variable for each of the categories of the original variable. 

If you consider our ratings, we end up with

G, NOT_G
PG, NOT_PG
PG-13, NOT_PG-13
R, NOT_R

We might not need NR anymore as it would represent a case where a movie is NOT_X for all 4 OHE variables. It ultimately depends on whether or not we want to provide a comparison for NR. (This provides context on *implicit* information vs. *explicit* information.)

**A few important notes:**
- There is no ordering, so consider this. OHE is more popular for nominal values, but it can still be used for ordinals, *especially when we don't want to assume there is equal spacing between the categories* (i.e. like a Likert scale)
- we use a method called ```get_dummies()``` in pandas. (the dummy is the OHE variable). It is creating an entirely new dataframe w/ a different set of variables to the original dataframe.
- OHE works best for categories w/ a small number of variables, especially for **ML Modeling**. Adding variables increases dimensionality of the dataframe, which creates problems w/ certain ML modeling techniques. (It can make it run slower, and add noise to the calculations)

So, it would make sense to use it on ```ratings```, where the distance between each rating is ambiguous. It wouldn't make sense to apply it to ```country```, because there are 200+ countries in the world (probably not as many making movies, but I digress). 

In [None]:
# OHE the ratings category
netflix_ratings_ohe = pandas.get_dummies(data=netflix,columns=['rating'])
netflix_ratings_ohe                                                        