#### You can create a categorical variable from a quantitative variable by creating your own categories. pandas' cut function let's you "cut" data in groups. Using this, I am going to create a new column called so_levels with these categories:

#### Strikeout Levels:
High: Lowest 25% of SO values 
Moderately High: 25% - 50% of SO values
Medium: 50% - 75% of SO values
Low: 75% - max SO value

In [1]:
import numpy as np
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
dfp = pd.read_csv('pitching.csv')
dfp.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,W,L,G,GS,CG,...,IBB,WP,HBP,BK,BFP,GF,R,SH,SF,GIDP
0,bechtge01,1871,1,PH1,,1,2,3,3,2,...,,7,,0,146.0,0,42,,,
1,brainas01,1871,1,WS3,,12,15,30,30,30,...,,7,,0,1291.0,0,292,,,
2,fergubo01,1871,1,NY2,,0,0,1,0,0,...,,2,,0,14.0,0,9,,,
3,fishech01,1871,1,RC1,,4,16,24,24,22,...,,20,,0,1080.0,1,257,,,
4,fleetfr01,1871,1,NY2,,0,1,1,1,1,...,,0,,0,57.0,0,21,,,


In [3]:
## View the min, 25%, 50%, 75%, max Strikeout values with Pandas describe
dfp.describe().SO

count    49430.000000
mean        45.988509
std         49.164188
min          0.000000
25%          8.000000
50%         30.000000
75%         67.000000
max        513.000000
Name: SO, dtype: float64

In [4]:
# Bin edges that will be used to "cut" the data into groups
bin_edges = [ 0,8,30,67,513] # Fill in this list with five values you just found

In [5]:
# Labels for the 4 groups 
bin_names = ['low', 'mod_low', 'mod_high', 'high'] 

In [6]:
# Creates SO_levels column
dfp['so_levels'] = pd.cut(dfp['SO'], bin_edges, labels=bin_names)

# Checks for successful creation of this column
dfp.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,W,L,G,GS,CG,...,WP,HBP,BK,BFP,GF,R,SH,SF,GIDP,so_levels
0,bechtge01,1871,1,PH1,,1,2,3,3,2,...,7,,0,146.0,0,42,,,,low
1,brainas01,1871,1,WS3,,12,15,30,30,30,...,7,,0,1291.0,0,292,,,,mod_low
2,fergubo01,1871,1,NY2,,0,0,1,0,0,...,2,,0,14.0,0,9,,,,
3,fishech01,1871,1,RC1,,4,16,24,24,22,...,20,,0,1080.0,1,257,,,,mod_low
4,fleetfr01,1871,1,NY2,,0,1,1,1,1,...,0,,0,57.0,0,21,,,,


In [7]:
# Find the mean Wins by strikeout level with groupby
dfp.groupby('so_levels').mean().W

so_levels
low          0.270873
mod_low      1.619163
mod_high     4.849558
high        11.407389
Name: W, dtype: float64

#### Looks like the average number of wins increases with each higher level of strikeout pitcher