### Assessing Variable Types

How we evaluate determines what kind of variable we have. Since there are only two ways to get data, there are only two types of variables: numerical and categorical.


In [None]:
import pandas as pd

In [None]:
movies = pd.read_csv("../Learn Pandas/csv/netflix_movies.csv")
movies.head()
# Rating variable type is categorical

### Categorical Variables

Categorical variables come in 3 types:

- Nominal variables, which describe something,
- Ordinal variables, which have an inherent ranking, and
- Binary variables, which have only two possible variations.

#### Nominal Variables

When we want to describe something about the world, we need a nominal variable. Nominal variables are usually words (i.e., red, yellow, blue or hot, cold), but they can also be numbers (i.e., zip codes or user id’s).

#### Ordinal variables

When our categories have an inherent order, we need an ordinal variable. Ordinal variables are usually described by numbers like 1st, 2nd, 3rd. Places in a race, grades in school, and the scales in survey responses (Likert Scales) are ordinal variables.

#### Binary variables

When there are only two logically possible variations, we need a binary variable. Binary variables are things like on/off, yes/no, and TRUE/FALSE. If there is any possibility of a third option, it is not a binary variable.


In [None]:
movies["country"].unique
# Country variable type is nominal

### Quantitative Variables

Numerical variables are created two ways:

- Continuous (measurements)
- Discrete (counts)

Continuous variables come from measurements. For a variable to be continuous, there must be infinitely smaller units of measurement between one unit and the next unit.
Discrete variables come from counting. For a variable to be discrete, there must be gaps between the smallest possible units. People, cars, and dogs are all good examples of discrete variables.


In [None]:
movies.head()
# release_year variable type is discrete

### Changing Numerical Variable Data Types

- Continuous (numerical) variables should usually be stored as the **float data type** because they allow us to store decimal values.
- Discrete (numerical) variables should be stored as the **int datatype** to represent mathematically that they are discrete.


In [None]:
movies.dtypes

In [None]:
movies["cast_count"].fillna(
    0, inplace=True
)  # the line of code that replaces the null values with 0. If this is not performed, it will give an error when changing the data type because of null values.
movies["cast_count"] = movies["cast_count"].astype("int64")
movies.dtypes

### Changing Categorical Variable Data Types

- Nominal variables are often represented by the object data type. Nominal variables are also represented by the string data type. However, Pandas usually guesses object rather than string
- Ordinal variables should be represented as objects, but pandas often guesses int
- Binary variables can be represented as bool, but pandas often guesses int or object data types.


In [None]:
movies["title"] = movies["title"].astype("string")
movies.dtypes

In [None]:
movies["type"] = movies["type"].astype("string")
movies.dtypes

### The Pandas Category Data Type


In [None]:
movies.head()

In [None]:
movies.rating.unique()

In [None]:
movies["rating"] = pd.Categorical(movies["rating"], ["NR", "G", "PG", "PG-13", "R"])
movies.rating.unique()

### One-Hot Encoding
With OHE, we essentially create a new binary variable for each of the categories within our original variable. This technique is useful when managing nominal variables because it encodes the variable without creating an order among the categories.

In [None]:
cereal = pd.read_csv("../Learn Pandas/csv/cereal.csv")
cereal.head()

In [None]:
cereal = pd.get_dummies(data=cereal, columns=["mfr"])
cereal.head()

### Review
- .head() to explore datasets 
- .dtypes to check data type of variables
- .astype() to change data types of variables
- pandas category data type
- pd.get_dummies() to OHE