## Pandas Introduction 


In [14]:
## import the library 
import pandas as pd 

In [15]:
## Read the data into pandas with the read_csv method 

fight_songs = pd.read_csv('../data/fight_song.csv')

In [20]:
fight_songs.head(5) 

Unnamed: 0,school,conference,song_name,writers,year,student_writer,official_song,contest,bpm,sec_duration,fight,number_fights,victory,win_won,victory_win_won,rah,nonsense,colors,men,opponents,spelling,trope_count,spotify_id
0,Notre Dame,Independent,Victory March,Michael J. Shea and John F. Shea,1908,No,Yes,No,152,64,Yes,1,Yes,Yes,Yes,Yes,No,Yes,Yes,No,No,6,15a3ShKX3XWKzq0lSS48yr
1,Baylor,Big 12,Old Fight,Dick Baker and Frank Boggs,1947,Yes,Yes,No,76,99,Yes,4,Yes,Yes,Yes,No,No,Yes,No,No,Yes,5,2ZsaI0Cu4nz8DHfBkPt0Dl
2,Iowa State,Big 12,Iowa State Fights,"Jack Barker, Manly Rice, Paul Gnam, Rosalind K...",1930,Yes,Yes,No,155,55,Yes,5,No,No,No,Yes,No,No,Yes,No,Yes,4,3yyfoOXZQCtR6pfRJqu9pl
3,Kansas,Big 12,I'm a Jayhawk,"George ""Dumpy"" Bowles",1912,Yes,Yes,No,137,62,No,0,No,No,No,No,Yes,No,Yes,Yes,No,3,0JzbjZgcjugS0dmPjF9R89
4,Kansas State,Big 12,Wildcat Victory,Harry E. Erickson,1927,Yes,Yes,No,80,67,Yes,6,Yes,No,Yes,No,No,Yes,No,No,No,3,4xxDK4g1OHhZ44sTFy8Ktm


In [17]:
## change the number of columns visible in the notebook 
pd.set_option('max_columns', 50)

## Attributes and Methods

Pandas offers a huge number of built in attributes in functions that we will use to "munge" data. You can use tab to pull them up. 

In [None]:
## scroll through some availible attributes
fight_songs.

## Initial exploratory methods 

Note `df` refers to any generic dataframe. It's important to remember some methods work on entire dataframes and others work on series (individuals columns). 

``` python 
## show first rows 
df.head()

## show last rows 
df.tail()

## show general info 
df.info()

## show rows, columns
df.shape

## numerical descriptions (there are many more than these)
df.mean()
df.std()
df.describe()
```
#### What is the mean length of all fight songs? 
#### How long is the longest fight song?

In [24]:
fight_songs.head(2)

Unnamed: 0,school,conference,song_name,writers,year,student_writer,official_song,contest,bpm,sec_duration,fight,number_fights,victory,win_won,victory_win_won,rah,nonsense,colors,men,opponents,spelling,trope_count,spotify_id
0,Notre Dame,Independent,Victory March,Michael J. Shea and John F. Shea,1908,No,Yes,No,152,64,Yes,1,Yes,Yes,Yes,Yes,No,Yes,Yes,No,No,6,15a3ShKX3XWKzq0lSS48yr
1,Baylor,Big 12,Old Fight,Dick Baker and Frank Boggs,1947,Yes,Yes,No,76,99,Yes,4,Yes,Yes,Yes,No,No,Yes,No,No,Yes,5,2ZsaI0Cu4nz8DHfBkPt0Dl


In [23]:
fight_songs.describe()

Unnamed: 0,bpm,sec_duration,number_fights,trope_count
count,65.0,65.0,65.0,65.0
mean,128.8,71.907692,2.846154,3.615385
std,33.152677,25.056014,3.231828,1.674182
min,65.0,27.0,0.0,0.0
25%,90.0,58.0,0.0,3.0
50%,140.0,67.0,2.0,4.0
75%,151.0,85.0,5.0,5.0
max,180.0,172.0,17.0,8.0


## Series

A series is an individual column - it's a different datatype than a dataframe and they have their own attributes and methods 

Access a series using square brackets 

``` python
df['column_name']
```

---
Take a look at this documentation:
[Value Counts ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) 

Note <div style="color:blue"> pandas.Series.value_counts </div>

This means that value_counts is used on a series as opposed to: [Drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)</n>

<div style="color:blue" > pandas.DataFrame.drop </div>

In [32]:
## use value counts on conference 
fight_songs['conference'].value_counts()

ACC            14
Big Ten        14
SEC            14
Pac-12         12
Big 12         10
Independent     1
Name: conference, dtype: int64

In [33]:
## creating a variable of one series 
column_conference = fight_songs['conference']

In [40]:
column_conference.value_counts()

ACC            14
Big Ten        14
SEC            14
Pac-12         12
Big 12         10
Independent     1
Name: conference, dtype: int64

In [42]:
## with normalized values 
column_conference.value_counts(normalize=True)

ACC            0.215385
Big Ten        0.215385
SEC            0.215385
Pac-12         0.184615
Big 12         0.153846
Independent    0.015385
Name: conference, dtype: float64

In [36]:
fight_songs['conference'].unique()

array(['Independent', 'Big 12', 'Big Ten', 'Pac-12', 'SEC', 'ACC'],
      dtype=object)

In [37]:
## number of unique values 
fight_songs['conference'].nunique()

6

## Pandas Broadcasting 

Applying some change to every single cell in a column 

In [None]:
## We'll create a new column called "years old" 

In [52]:
## defining a small funtion that we'll apply to every cell in the "year" column

def munger(cell):
    if cell == 'Unknown':
        return 2019
    else:
        return int(cell)

### 
fight_songs['year'] = fight_songs['year'].map(munger)

In [55]:
## new column with broadcasting 

fight_songs['years_old'] = 2019 - fight_songs['year']

## A Taste of what's to come - Boolean Filtering,  Groupby, Map

Don't worry about following along here - I just want to introduce this concept and provide an example 

In [58]:
## boolean filter 

is_short = fight_songs['sec_duration'] < 61

In [62]:
## grabbing short songs 

fight_songs.loc[is_short, :].head() ## passing the boolean filter to .loc the .head just gives us the first 5 rows

Unnamed: 0,school,conference,song_name,writers,year,student_writer,official_song,contest,bpm,sec_duration,fight,number_fights,victory,win_won,victory_win_won,rah,nonsense,colors,men,opponents,spelling,trope_count,spotify_id,years_old
2,Iowa State,Big 12,Iowa State Fights,"Jack Barker, Manly Rice, Paul Gnam, Rosalind K...",1930,Yes,Yes,No,155,55,Yes,5,No,No,No,Yes,No,No,Yes,No,Yes,4,3yyfoOXZQCtR6pfRJqu9pl,89
5,Oklahoma,Big 12,Boomer Sooner,Arthur M. Alden,1905,Yes,Yes,No,153,37,No,0,No,No,No,Yes,No,No,No,No,Yes,2,0QXC8Gg1oKWkORegslTXoT,114
6,Oklahoma State,Big 12,Ride 'Em Cowboys,J.K. Long,1934,No,Yes,No,180,29,Yes,5,Yes,No,Yes,No,No,No,Yes,Yes,No,4,0mTJqaacUZPG740Y1YDn5j,85
8,TCU,Big 12,TCU Fight Song,Claude Sammis,1928,No,Yes,No,149,47,Yes,2,Yes,No,Yes,Yes,No,Yes,Yes,No,Yes,6,0ItcRLvqHlbkaqMCPtQKUl,91
9,Texas Tech,Big 12,"Fight, Raiders, Fight",Carroll McMath,1936,Yes,Yes,No,159,54,Yes,8,Yes,No,Yes,No,No,Yes,No,No,No,3,3DfKi9Iqvtxf4DIjs8ezTq,83


In [67]:
### using groupby 

fight_songs.groupby('conference')['sec_duration'].mean()

conference
ACC            73.571429
Big 12         60.700000
Big Ten        82.857143
Independent    64.000000
Pac-12         71.500000
SEC            68.214286
Name: sec_duration, dtype: float64

In [75]:
### define function
def is_state(s):
    if 'State' in s:
        return 1
    else:
        return 0



In [73]:
## the function working on one individual input 
is_state('Notre Dame')

0

In [None]:
## make new column with function 
fight_songs['State School'] = fight_songs['school'].map(is_state)

In [77]:
fight_songs.head()

Unnamed: 0,school,conference,song_name,writers,year,student_writer,official_song,contest,bpm,sec_duration,fight,number_fights,victory,win_won,victory_win_won,rah,nonsense,colors,men,opponents,spelling,trope_count,spotify_id,years_old,State School
0,Notre Dame,Independent,Victory March,Michael J. Shea and John F. Shea,1908,No,Yes,No,152,64,Yes,1,Yes,Yes,Yes,Yes,No,Yes,Yes,No,No,6,15a3ShKX3XWKzq0lSS48yr,111,0
1,Baylor,Big 12,Old Fight,Dick Baker and Frank Boggs,1947,Yes,Yes,No,76,99,Yes,4,Yes,Yes,Yes,No,No,Yes,No,No,Yes,5,2ZsaI0Cu4nz8DHfBkPt0Dl,72,0
2,Iowa State,Big 12,Iowa State Fights,"Jack Barker, Manly Rice, Paul Gnam, Rosalind K...",1930,Yes,Yes,No,155,55,Yes,5,No,No,No,Yes,No,No,Yes,No,Yes,4,3yyfoOXZQCtR6pfRJqu9pl,89,1
3,Kansas,Big 12,I'm a Jayhawk,"George ""Dumpy"" Bowles",1912,Yes,Yes,No,137,62,No,0,No,No,No,No,Yes,No,Yes,Yes,No,3,0JzbjZgcjugS0dmPjF9R89,107,0
4,Kansas State,Big 12,Wildcat Victory,Harry E. Erickson,1927,Yes,Yes,No,80,67,Yes,6,Yes,No,Yes,No,No,Yes,No,No,No,3,4xxDK4g1OHhZ44sTFy8Ktm,92,1
