# Pandas Scales :

# 1. Ratio Scales :

**in ratio scale**:

* the measurement units **are equally spaced** and there **is a true zero**.
* the mathematical operations of **+-/*** are all valid.
* a good example for ratio scales is **height and weight**.


# 2. Interval Scales :

**in interval scale** :

* the measurement units **are eaqually spaced** like Ratio Scales, but there **is no true zero**.
* the mathematical operations of __-+__ are all valid.
* a good example for Interval Scales is :
    * **temperature measurement** in celsius and fahrenheit. in this case, **zero degree is actually a meaningful value itself**.
    * **the direction on a campus** is another good example where **zero direction doesn't indicate a lack of direction**, but instead it **describes a direction** itself.

**in data minig**, it's important to **have a distinction between the different scales** in our mind **when we're applying the advanced statistics tests to the different algorthems**.

# 3. Ordinal Scales :

**in ordinal scale** :

* **the order of units (or values) is important** but the **differences between values are not equally spaced**.
* we can **create an oridinal object (or an ordered categorical object)** with **passing a category and setting the ordered parameter** into the __True__ value in **CategoricalDtype() class**.
* we can use a certain serie of **mathematical operations like max, min, and etc on the original object**.
* a good example for ordinal scales is :
* **letter grades** such as A-, A, A+.
* **the human evolution process**.

**oridinal scale is very common in machine learning** and sometimes it's can be a bit of challenge to work with.

# 4. Nominal Scales (or Categorical Data) :

**in nominal scale** :

* there are **categories of data** that **have no order with respect to the other categories**.
* a good example for nominal scale is **the teams of a sport**.
* **changing their order** with a mathematical function is **meaningless**.
* **categorical values are very common** and we generally **refer to categories** where there are **2 only possible values as binary categories**.

In [5]:
import pandas as pd
import numpy as np

In [9]:
df = pd.DataFrame(['A+','A', 'A-', 'B+', 'B', 'B-',
                  'C+', 'C', 'C-', 'D+', 'D', 'D-'],
                 index= ['excellent', 'excellent', 'excellent',
                        'good', 'good', 'good', 'ok', 'ok', 'ok',
                        'poor', 'poor', 'poor'])
df.columns = ["grades"]
df

Unnamed: 0,grades
excellent,A+
excellent,A
excellent,A-
good,B+
good,B
good,B-
ok,C+
ok,C
ok,C-
poor,D+


if we check the datatype of this column, we see that it's just an object, since we set string values for a column.

In [22]:
df.dtypes

grades    object
dtype: object

with __.astype()__, we can **change the type of a dataframe or a series object to our arbitrary type** we are interested in.

we can change **the type of below series object to the categorical type** with **passing category__ to the .astype() method**.

In [23]:
type(df["grades"].astype("category"))

pandas.core.series.Series

In [59]:
df["grades"].astype("category")

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
ok            C
ok           C-
poor         D+
poor          D
poor         D-
Name: grades, dtype: category
Categories (12, object): ['A', 'A+', 'A-', 'B', ..., 'C-', 'D', 'D+', 'D-']

We see now that there are twelve categories, and pandas is aware of what those categories are. More interesting though is that our data isn't just categorical, but that it's ordered, but that it's ordered. That is, an A- comes after a B+, and B comes before a B+.

we can __create a categorical object (or a nominal object)__ with **passing a category into the CategoricalDtype() class**.

In [55]:
categorical_object = pd.CategoricalDtype(categories= ['D-', 'D', 'D+',
                                                   'C-', 'C', 'C+',
                                                   'B-', 'B', 'B+',
                                                   'A-', 'A', 'A+'])
print(type(categorical_object))

<class 'pandas.core.dtypes.dtypes.CategoricalDtype'>


In [56]:
categorical_series = df["grades"].astype(categorical_object)
categorical_series

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
ok            C
ok           C-
poor         D+
poor          D
poor         D-
Name: grades, dtype: category
Categories (12, object): ['D-', 'D', 'D+', 'C-', ..., 'B+', 'A-', 'A', 'A+']

we can **create an ordinal object (or an ordered categorical object)** with __.CategoricalDtype() class__. 

In [38]:
ordinal_object = pd.CategoricalDtype(categories= ['D-', 'D', 'D+',
                                                   'C-', 'C', 'C+',
                                                   'B-', 'B', 'B+',
                                                   'A-', 'A', 'A+'],
                                          ordered= True) 
print(type(ordinal_object))

<class 'pandas.core.dtypes.dtypes.CategoricalDtype'>


In [50]:
ordinal_series = df["grades"].astype(ordinal_object)
ordinal_series

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
ok            C
ok           C-
poor         D+
poor          D
poor         D-
Name: grades, dtype: category
Categories (12, object): ['D-' < 'D' < 'D+' < 'C-' ... 'B+' < 'A-' < 'A' < 'A+']

Now we see that pandas is not only aware that there are 12 categories, but it is also aware of the order of those categoreies.

So, **what can we do with an ordinal series or a ordinal dataframe object?** 
Well because **there is an ordering, this can help us with comparisons and boolean masking**.

in the following cases, **we don't have an ordered categorical dataframe or series (or an ordinal series or dataframe)**.

For instance, if we have a list of our grades and we compare them to a “C” we see that the lexicographical comparison returns results we were not intending because we don't have an ordered categorical dataframe or a series object here.

In [51]:
df[df["grades"]> 'C'] # series

Unnamed: 0,grades
ok,C+
ok,C-
poor,D+
poor,D
poor,D-


In [53]:
categorical_df[categorical_df > 'C'] # categorical series (or nominal seires)

TypeError: Unordered Categoricals can only compare equality or not

if we broadcast over the dataframe which has the type set to an ordered categorical, we can solve this problem.

In [54]:
ordinal_series[ordinal_series > 'C']

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
Name: grades, dtype: category
Categories (12, object): ['D-' < 'D' < 'D+' < 'C-' ... 'B+' < 'A-' < 'A' < 'A+']

**Sometimes it is useful to represent categorical values** as __each__ being a **column with a true or a false** as to whether the category applies. This is especially common in **feature extraction**, which is a topic in the data mining course. 

**Variables with a boolean value** are typically called **dummy variables**, and **pandas has a builtin function** called **.get_dummies()** which **considers each value of a column we are interested in as a single column, and fill it with zeros and ones** indicating the **presence of the dummy variable**.

In [60]:
pd.get_dummies(df["grades"])

Unnamed: 0,A,A+,A-,B,B+,B-,C,C+,C-,D,D+,D-
excellent,0,1,0,0,0,0,0,0,0,0,0,0
excellent,1,0,0,0,0,0,0,0,0,0,0,0
excellent,0,0,1,0,0,0,0,0,0,0,0,0
good,0,0,0,0,1,0,0,0,0,0,0,0
good,0,0,0,1,0,0,0,0,0,0,0,0
good,0,0,0,0,0,1,0,0,0,0,0,0
ok,0,0,0,0,0,0,0,1,0,0,0,0
ok,0,0,0,0,0,0,1,0,0,0,0,0
ok,0,0,0,0,0,0,0,0,1,0,0,0
poor,0,0,0,0,0,0,0,0,0,0,1,0


There’s one more common scale-based operation I’d like to talk about, and that’s on **converting** a scale from something that is on **the interval or ratio scale, like a numeric grade, into** one which is **categorical**. 

**where can we use this ?**

it’s commonly done in a couple of places. For instance:
* if we are **visualizing the frequencies of categories**, this can be an extremely useful approach.
* **histograms are regularly used** with converted interval or ratio data. 
* if we are using a **machine learning classification approach** on data, we need to be **using categorical data**, so **reducing dimensionality may be useful just to apply a given technique**. 

**to convert the interval or ratio scale into the categorical scale**, we must use **.cut()** built-in function.

Pandas has a function called **.cut()** which **takes** as an argument like **a column of a dataframe or a series**. It also **takes a number of bins** to be used, and **all bins are kept at equal spacing**.

Lets go back to our census data for an example. We saw that we could group by state, then aggregate to get a list of the average county size by state. If we further apply cut to this with, say, ten bins, we can see the states listed as categoricals using the average county size.


In [89]:
df = pd.read_csv("datasets/census.csv")

df = df[df["SUMLEV"] == 50]

series = df.groupby("STNAME")["CENSUS2010POP"].agg(np.nanmean)
series

STNAME
Alabama                  71339.343284
Alaska                   24490.724138
Arizona                 426134.466667
Arkansas                 38878.906667
California              642309.586207
Colorado                 78581.187500
Connecticut             446762.125000
Delaware                299311.333333
District of Columbia    601723.000000
Florida                 280616.567164
Georgia                  60928.635220
Hawaii                  272060.200000
Idaho                    35626.863636
Illinois                125790.509804
Indiana                  70476.108696
Iowa                     30771.262626
Kansas                   27172.552381
Kentucky                 36161.391667
Louisiana                70833.937500
Maine                    83022.562500
Maryland                240564.666667
Massachusetts           467687.785714
Michigan                119080.000000
Minnesota                60964.655172
Mississippi              36186.548780
Missouri                 52077.626087
Monta

In [88]:
print(series.max())
print(series.min())

642309.5862068966
12336.060606060606


In [91]:
pd.cut(series, bins=10) # we want 10 unordered categories

STNAME
Alabama                   (11706.087, 75333.413]
Alaska                    (11706.087, 75333.413]
Arizona                 (390320.176, 453317.529]
Arkansas                  (11706.087, 75333.413]
California              (579312.234, 642309.586]
Colorado                 (75333.413, 138330.766]
Connecticut             (390320.176, 453317.529]
Delaware                (264325.471, 327322.823]
District of Columbia    (579312.234, 642309.586]
Florida                 (264325.471, 327322.823]
Georgia                   (11706.087, 75333.413]
Hawaii                  (264325.471, 327322.823]
Idaho                     (11706.087, 75333.413]
Illinois                 (75333.413, 138330.766]
Indiana                   (11706.087, 75333.413]
Iowa                      (11706.087, 75333.413]
Kansas                    (11706.087, 75333.413]
Kentucky                  (11706.087, 75333.413]
Louisiana                 (11706.087, 75333.413]
Maine                    (75333.413, 138330.766]
Maryland     

these are our 10 categories:

* (11706.087, 75333.413]
* (75333.413, 138330.766]
* (138330.766, 201328.118]
* (201328.118, 264325.471]
* (264325.471, 327322.823]
* (327322.823, 390320.176]
* (390320.176, 453317.529]
* (453317.529, 516314.884]
* (516314.884, 579312.234]
* (579312.234, 642309.586]

In [72]:
138330.766 - 75333.413

62997.353

In [73]:
201328.118 - 138330.766

62997.351999999984

In [74]:
264325.471 - 201328.118

62997.35300000003

In [75]:
327322.823 - 264325.471

62997.351999999955

Here we see that states like alabama and alaska fall into the same category, while california and the disctrict of columbia fall in a very different category.

Now, **cutting is just one way to build categories from your data**, and **there are many other methods**. 

For instance, **.cut() gives us interval data, where the spacing between each category is equal sized**. sometimes we want to **form categories based on frequency**, and we want **the number of items in each bin to the be the same**, instead of the spacing between bins (or categories).

It really depends on **what the shape of our data is**, and **what we’re planning to do with it**.