# Scales
* We're going to talk about things you probably learned in grade school but also probably don't think about much
* And of course, we're going to talk about them in Pandas!

In [1]:
# Let's look at some letter grades...
import pandas as pd
df=pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 
                       'ok', 'ok', 'ok', 'poor', 'poor'],
               columns=["Grades"])
df

Unnamed: 0,Grades
excellent,A+
excellent,A
excellent,A-
good,B+
good,B
good,B-
ok,C+
ok,C
ok,C-
poor,D+


In [2]:
# What is our series datatype?
df.dtypes

Grades    object
dtype: object

* That seems pretty broad, eh? "object" pretty much means anything...
* We know more here. We have clear categories that have meaning to us as people. We can put this meaning into pandas `DataFrame` objects

In [3]:
# We can use the astype() function to tell pandas to mark this as a category
df["Grades"].astype("category").head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [A, A+, A-, B, ..., C+, C-, D, D+]

* Notice that there are now 11 categories!
* But actually, our data isn't really categorical, is it? What else do we know about this data?

In [4]:
# We can tell pandas that the data is ordered by first creating our own data type
my_categories=pd.CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'], 
                           ordered=True)
# then we just pass this to the astype() function
grades=df["Grades"].astype(my_categories)
grades.head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

In [5]:
# Now we can do ordinal comparisons! Look at the bad example first (no category original dataframe)
df[df["Grades"]>"C"]

Unnamed: 0,Grades
ok,C+
ok,C-
poor,D+
poor,D


In [6]:
# Now how's that look in a category aware sense?
grades[grades>"C"]

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

* Great! So we can encapsulate a limited set of data types (categories) and an ordering if appropriate (through our own dtype) in pandas and it allows us to do operations we otherwise couldn't do
* Now, it turns out we use this in machine learning and data mining a fair bit. Some techniques (regression) are used to predict continuous values, while others (classification) are used to predict categories
* So how do we change from continuous data to categorical data in pandas? I'm glad you asked!

In [7]:
# Let's look at that census data
import numpy as np
df=pd.read_csv("datasets/census.csv")
df=df[df['SUMLEV']==50]
df=df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg(np.average)
df.head()

STNAME
Alabama        71339.343284
Alaska         24490.724138
Arizona       426134.466667
Arkansas       38878.906667
California    642309.586207
Name: CENSUS2010POP, dtype: float64

In [8]:
# Now if we just want to make "bins" of each of these, we can use cut()
# this just takes the dataframe, and the number of bins, and returns a new dataframe
df=pd.cut(df,10)
df.head()

STNAME
Alabama         (11706.087, 75333.413]
Alaska          (11706.087, 75333.413]
Arizona       (390320.176, 453317.529]
Arkansas        (11706.087, 75333.413]
California    (579312.234, 642309.586]
Name: CENSUS2010POP, dtype: category
Categories (10, interval[float64]): [(11706.087, 75333.413] < (75333.413, 138330.766] < (138330.766, 201328.118] < (201328.118, 264325.471] ... (390320.176, 453317.529] < (453317.529, 516314.881] < (516314.881, 579312.234] < (579312.234, 642309.586]]

* Notice the notation is mathematical (open/closed intervals)
* See how Alabama and Alaska are now in the same category, but Arizon is in another category
* Notice that pandas ordered all of these now too
* What happens if we want to add a new value into the mix?

In [9]:
df.loc["Canada"]=50000
df.tail()

STNAME
Washington       (138330.766, 201328.118]
West Virginia      (11706.087, 75333.413]
Wisconsin         (75333.413, 138330.766]
Wyoming            (11706.087, 75333.413]
Canada                              50000
Name: CENSUS2010POP, dtype: object