# MSAS Python Tutorial Three

**Intro to dataframe analysis**

We will begin this tutorial by getting everyone setup to have the capability to explore data analysis within python

Here is a list of libraries that are useful to already have installed, and we will discuss how to install them:

* pandas
* numpy
* regex
* SciPy
* matplotlib
* scikit-learn

There are many more useful libraries in the world of data science, but these are a good start because there is a decent chance I will use them in the coming tutorials

**The following code snippet is reliable when installing libraries**

In [None]:
import sys
!conda install --yes --prefix {sys.prefix} scikit-learn

**If python says that conda is not updated, I have heard this code snippet is helpful**

**However, do not run this now, I am pretty sure it takes a long time to run!**

In [None]:
#conda install -n base -c defaults conda=23.7.4

Let's start by importing the pandas library:

In [None]:
import pandas as pd

The most fundamental aspect of the pandas library is that it allows us to read in a CSV file, (Comma Seperated Values) which is a form of tabular data storage.

In [None]:
df = pd.read_csv('statsBomb Michigan 2022 Plays.csv')

In [None]:
df.head()

The head method is very useful as it allows us to take a quick glance at the data to ensure that it looks the way we want it to

**Now let's take a look a a few important data frame attributes**

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.dtypes

Column isolation and distinguishing between Dataframes and Series

In [None]:
col1 = df['offense_team_name']
print(col1)
display(type(col1))

**Splicing and subsetting a dataframe:**

* This is an essential skill when analyzing dataframes in Python

What if I was only interested in JJ's pass placement displacement from the Ohio State game?

In [None]:
df = df.loc[:,['defense_team_name','play_pass_placement_displacement']]
df.head()

In [None]:
df = df[df['defense_team_name'] == 'Ohio State Buckeyes']
df.head()

How do you guys think I could eliminate the null values from the dataframe?

In [None]:
df = df[df['play_pass_placement_displacement'].notna()]
df.head()

Try this subsetting problem on your own. Let's say that you only want to investigate plays when Michigan is running a two minute offense, how could you do that? 

Hint: Look at the CSV directory to observe which columns you will need to use in your boolean mask
Hint: 120 seconds is equal to 120,000 milliseconds

**Lastly, let's look at a few useful methods that pandas has to offer:**

In [None]:
df.describe()

In [None]:
df.sort_values(by='play_pass_placement_displacement',ascending=False,inplace=True)

In [None]:
df.set_index('defense_team_name',inplace=True)

In [None]:
df.head()

**I hope you guys enjoyed this intro to dataframe analysis in python! Definitely try some of this stuff on your own! Remember, we just scratched the surface with attributes, methods, and boolean masking in pandas, so if you are wondering if its possible to view or manipulate your dataframe in a certain way, just look it up!**