# Scales
* We're going to talk about things you probably learned in grade school but also probably don't think about much
* And of course, we're going to talk about them in Pandas!

In [1]:
# Let's look at some letter grades...
import pandas as pd
df=pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 
                       'ok', 'ok', 'ok', 'poor', 'poor'],
               columns=["Grades"])
df

Unnamed: 0,Grades
excellent,A+
excellent,A
excellent,A-
good,B+
good,B
good,B-
ok,C+
ok,C
ok,C-
poor,D+


In [2]:
# What is our series datatype?
df.dtypes

Grades    object
dtype: object

* That seems pretty broad, eh? "object" pretty much means anything...
* We know more here. We have clear categories that have meaning to us. We can put this meaning into pandas `DataFrame` objects

In [3]:
# We can use the astype() function to tell pandas to mark this as a category
df['Grades'].astype('category').head()  #changes grades column from string to category, type is now category

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [A, A+, A-, B, ..., C+, C-, D, D+]

* Notice that there are now 11 categories!
* But actually, our data isn't really categorical, is it? What else do we know about this data?

In [4]:
# We can tell pandas that the data is ordered by first creating our own data type
my_categories = pd.CategoricalDtype(categories = ['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'], ordered=True)   #ordered set
# then we just pass this to the astype() function
grades = df['Grades'].astype(my_categories) #astype using my_categories variable
grades.head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

In [5]:
# Now we can do ordinal comparisons! Look at the bad example first (no category original dataframe)
df[df['Grades'] > 'C'] #finding stuff after c

Unnamed: 0,Grades
ok,C+
ok,C-
poor,D+
poor,D


In [7]:
# Now how's that look in a category aware sense?
grades[grades > 'C'] #getting grades that are greater than c using grades variable

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

* Great! So we can encapsulate a limited set of data types (categories) and an ordering if appropriate (through our own dtype) in pandas and it allows us to do operations we otherwise couldn't do
* Now, it turns out we use this in machine learning and data mining a fair bit. Some techniques (regression) are used to predict continuous values, while others (classification) are used to predict categories
* So how do we change from continuous data to categorical data in pandas? I'm glad you asked!

In [7]:
# Let's look at that census data
import numpy as np
df=pd.read_csv("datasets/census.csv")
result=df[df['SUMLEV']==50]
result=df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg(np.average)
result.head()

STNAME
Alabama       1.405805e+05
Alaska        4.734873e+04
Arizona       7.990021e+05
Arkansas      7.673468e+04
California    1.262846e+06
Name: CENSUS2010POP, dtype: float64

In [8]:
# Now if we just want to make "bins" of each of these, we can use cut()
# this just takes the dataframe, and the number of bins, and returns a new dataframe
result = pd.cut(result, [1000, 10000, 100000, 1000000, 10000000])  #bining values into discrete intervals
result.head()

STNAME
Alabama         (100000, 1000000]
Alaska            (10000, 100000]
Arizona         (100000, 1000000]
Arkansas          (10000, 100000]
California    (1000000, 10000000]
Name: CENSUS2010POP, dtype: category
Categories (4, interval[int64]): [(1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]

In [12]:
result = pd.cut(result, 10) #creating 10 equal size bins
result.head()

STNAME
Alabama          (23065.339, 148158.089]
Alaska           (23065.339, 148158.089]
Arizona          (767429.132, 891283.34]
Arkansas         (23065.339, 148158.089]
California    (1138991.758, 1262845.966]
Name: CENSUS2010POP, dtype: category
Categories (10, interval[float64]): [(23065.339, 148158.089] < (148158.089, 272012.298] < (272012.298, 395866.506] < (395866.506, 519720.715] ... (767429.132, 891283.34] < (891283.34, 1015137.549] < (1015137.549, 1138991.758] < (1138991.758, 1262845.966]]

* Notice the notation is mathematical (open/closed intervals)
* See how Alabama and Alaska are now in the same category, but Arizona is in another category
* Notice that pandas ordered all of these now too
* More on categories: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

In [13]:
result.cat.rename_categories(['F', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'])

STNAME
Alabama                  F
Alaska                   F
Arizona                 B+
Arkansas                 F
California              A+
Colorado                C-
Connecticut             B+
Delaware                C+
District of Columbia    B-
Florida                 B-
Georgia                  F
Hawaii                  C+
Idaho                    F
Illinois                C-
Indiana                  F
Iowa                     F
Kansas                   F
Kentucky                 F
Louisiana                F
Maine                   C-
Maryland                C+
Massachusetts           B+
Michigan                C-
Minnesota                F
Mississippi              F
Missouri                 F
Montana                  F
Nebraska                 F
Nevada                   C
New Hampshire           C-
New Jersey              B+
New Mexico               F
New York                B-
North Carolina          C-
North Dakota             F
Ohio                    C-
Oklahoma             