## Data Analysis and Visualization in Pandas and Matplotlib ##

This is our Data Visualization in Python Jupyter Notebook. We will learn to use the Pandas and Matplotlib libraries to take our data and do some data visualizations with it. 

The first thing we need to do is learn some background information about the libraries and technologies we will be using. Because we are all using the Anaconda software distribution today, Anaconda comes with a lot of functionality installed on top of the base python libraries. This includes the pandas and matplotlib packages as well as the JupyterLab/Jupyter Notebook Environment.

## JupyterLab/Jupyter Notebooks ##

[Project Jupyter Homepage](https://jupyter.org/)

The Jupyter environment is a web-based interactive computational environment for creating notebook-like documents. It supports several languages like python, R, Julia, etc. JupyterLab is the next generation user interface, which includes Jupyter Notebooks.

In my opinion, they seem almost exactly the same but I'm sure people embedded within the development of the project would tell you differently.

Think of the Jupyter environment as an interactive blog post. As you'll see, Jupyter allows you to show your code and explain it in a very neat, easy to follow way. Each cell either contains text (like this one) or code. When writing code, each cell basically functions like the command line or console. And as you'll see, each cell is LIVE and you can change your code on the fly. 

Jupyter really excels in situations like this class where we will be walking through a topic step by step. I can explain things, you can play with the code, and it is easy for everyone to see. 

<strong>1.</strong> Let's get started by importing the libraries we will be using. I will explain these later as we go. Because we are all using Anaconda, all of these libraries are already installed.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt     #I am pretty sure pyplot is the original functionality of matplotlib
import matplotlib.ticker as ticker
import numpy as np


<strong> 2. </strong> First we need to read in some data, so we can then work with it. This is a CSV sheet of career stats for professional baseball player, Mike Trout. Baseball is a numbers game so this gives us a nice, easy to use dataset to work use.

In [3]:
df = pd.read_csv("MikeTroutData.csv")

## Pandas ##

[Pandas Documentation](https://pandas.pydata.org/)

Pandas is an open source python library providing high-performance, easy-to-use data structures and data analysis tools. We will be using pandas to work with our data before feeding it into Matplotlib. Pandas can read from and write to many different data formats. It is intelligent in handling missing or bad data. You can easily reshape or pivot your data. It is optimized for performance. And it has a massive international user community so help and examples are readily available. 

## Pandas Dataframes ##

The aforementioned easy to use data structure in pandas is called a [pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe). This is a tabular or spreadsheet-like view of your data, just as you'd see it in Excel. A pandas dataframe is a 2-dimensional labeled data structure with columns and rows. It is the most commonly used pandas object. Each one dimensional row or column is called a [pandas series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) Along with the data, you can pass index (row labels) or columns as arguments. 

<strong> 3. </strong> Let's take a look at our data

In [4]:
print(df)

   Year  Age    G   AB    R    H  HR     BA    Salary  Awards
0  2011   19   40  123   20   27   5  0.220     36000       0
1  2012   20  139  559  129  182  30  0.326    492500       4
2  2013   21  157  589  109  190  27  0.323    510000       3
3  2014   22  157  602  115  173  36  0.287   1000000       3
4  2015   23  159  575  104  172  41  0.299   6083000       3
5  2016   24  159  549  123  173  29  0.315  16083000       3
6  2017   25  114  402   92  123  33  0.306  20083000       2
7  2018   26  140  471  101  147  39  0.312  34083000       3
8  2019   27  134  470  110  137  45  0.291  36833333       1


<strong> 4. </strong> We will want to slice and dice the data so let's see how to access the data by it's column header. 

In [8]:
print(df.keys())     #a built in .keys() function
print()
print(df.columns.tolist())   # see the data in a list
print()
print(df['Year'])      #access an individual column using a dictionary syntax  (This is what I prefer)
print()
print(df.AB)           #access a column using the name as an attribute of the dataframe

Index(['Year', 'Age', 'G', 'AB', 'R', 'H', 'HR', 'BA', 'Salary', 'Awards'], dtype='object')

['Year', 'Age', 'G', 'AB', 'R', 'H', 'HR', 'BA', 'Salary', 'Awards']

0    2011
1    2012
2    2013
3    2014
4    2015
5    2016
6    2017
7    2018
8    2019
Name: Year, dtype: int64

0    123
1    559
2    589
3    602
4    575
5    549
6    402
7    471
8    470
Name: AB, dtype: int64


<strong> 5. </strong> I am renaming some of the columns we will be using, just for the sake of simplicity. It is easier to refer to these variable names than the entire syntax of each column.

In [9]:
year = df['Year']
hits = df['H']
at_bats = df['AB']
home_runs = df['HR']
salary = df['Salary']

<strong> 6. </strong> I can now use these variable names just like any other object

In [10]:
print(at_bats)

0    123
1    559
2    589
3    602
4    575
5    549
6    402
7    471
8    470
Name: AB, dtype: int64


<strong> 7. </strong> We can also create new columns. We will start with a blank one. 

In [11]:
df['new_column'] = np.nan

In [12]:
print(df['new_column'])

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
7   NaN
8   NaN
Name: new_column, dtype: float64


<strong> 8. </strong> We can delete columns. I'll delete the nonsense column I just created, but more commonly this is used to clean your datasets of extraneous data

In [13]:
del df['new_column']

In [15]:
print(df['new_column'])      #this will result in an error

KeyError: 'new_column'

<strong> 9. </strong> So far, we have only used our columns to access data and neglected rows. Rows are indexed starting at 0. If you think of other python objects, such as lists, the same concept applies. So the first row in the data would be in position 0 (the header row is excluded and treated separately).

Accessing rows in pandas is done using the .loc() and .iloc() functions and is slightly more involved than just using the column header to access data. We will start with .iloc(). 

In [18]:
print(df.iloc[:5])    #prints first 5 rows of data, notice the index row to the left of the data

   Year  Age    G   AB    R    H  HR     BA   Salary  Awards
0  2011   19   40  123   20   27   5  0.220    36000       0
1  2012   20  139  559  129  182  30  0.326   492500       4
2  2013   21  157  589  109  190  27  0.323   510000       3
3  2014   22  157  602  115  173  36  0.287  1000000       3
4  2015   23  159  575  104  172  41  0.299  6083000       3


<strong> 10. </strong> You can do slicing and similar operations just as you would with a python list using the .loc() function

In [25]:
print(df.iloc[2:3])      #prints only row at index 2  
print()
print(df.iloc[5:])     #prints everything row 5 and up

   Year  Age    G   AB    R    H  HR     BA  Salary  Awards
2  2013   21  157  589  109  190  27  0.323  510000       3

   Year  Age    G   AB    R    H  HR     BA    Salary  Awards
5  2016   24  159  549  123  173  29  0.315  16083000       3
6  2017   25  114  402   92  123  33  0.306  20083000       2
7  2018   26  140  471  101  147  39  0.312  34083000       3
8  2019   27  134  470  110  137  45  0.291  36833333       1


<strong> 11. </strong> The .loc() functions works somewhat counterintuitively but makes sense once you get the hang of it. Basically, you are accessing a row based on the value located in a column. See the following examples.

In [27]:
young_age = df.loc[df['Age'] < 22]

print(young_age)

   Year  Age    G   AB    R    H  HR     BA  Salary  Awards
0  2011   19   40  123   20   27   5  0.220   36000       0
1  2012   20  139  559  129  182  30  0.326  492500       4
2  2013   21  157  589  109  190  27  0.323  510000       3


<strong> 12. </strong> So you see above, I have effectively located the data for rows in which the column value is less than 22. Let's do another example.

In [31]:
high_batting_average = df.loc[df['BA'] > .320]

print(high_batting_average)

   Year  Age    G   AB    R    H  HR     BA  Salary  Awards
1  2012   20  139  559  129  182  30  0.326  492500       4
2  2013   21  157  589  109  190  27  0.323  510000       3


<strong> 13. </strong> One more example, let's create a new column and write data to it using a .loc() statement. We can actually do this all in one statement which I'll first show you and then explain. 

In [38]:
df.loc[df['BA'] > .320, 'High Batting Average'] = 'Yes'

test = df[['BA', 'High Batting Average']]
print(test)

      BA High Batting Average
0  0.220                  NaN
1  0.326                  Yes
2  0.323                  Yes
3  0.287                  NaN
4  0.299                  NaN
5  0.315                  NaN
6  0.306                  NaN
7  0.312                  NaN
8  0.291                  NaN


<strong> 14. </strong> The above statement is a little more complicated. But as you can see, the first part is what we already did above. I selected rows with a batting average of > .320.  The second argument of this statement actually gives a name to the new column ('High Batting Average') and then populates it with the value of 'Yes' if the statement is true. So the interpreter iterates through each row of the dataframe to evaluate this statement. If it is true, the value 'Yes' is written to the new column. 


Pandas selection statements can get very tedious and there are endless variations and much more functionality than I have demonstrated. But for now, let's move on to visualizing the data.

## Matplotlib Plots ##

Now we will begin plotting in [matplotlib](https://matplotlib.org/). Because we have our data stored in a pandas dataframe. We can now analyze it how we like. I'll be working with some of the basic plot types. This will barely scratch the surface. I will also be adding some customization and formatting to show you that you can basically customize your plots to look however you like.


## Bar Plots

This is a very simple plot of Mike Trout's hits per year. Let's start from the bottom

In [None]:
plt.bar(year, hits)
plt.show()

So as you see, I've got # of Hits on the Y Axis, and Year on the X Axis. But what are the year and hits objects?

In [None]:
print(type(year))
print(type(hits))

As you see, these are pandas Series objects. Again, a series is a 1-Dimensional array of data. I'll be transforming my pandas dataframe to extract different series of objects, so I can plot them using matplotlib.

Our first plot was as basic as it gets. Let's add some labels to make it look a little better.

In [None]:
plt.xlabel('Year')
plt.ylabel('# of Hits')
plt.suptitle('Mike Trout Hits per year')
plt.bar(year, hits)
plt.show()

## Horizontal Bar Plots

Let's turn our bar plot sideways

In [None]:

plt.xlabel('# of Hits')
plt.ylabel('Year')
plt.suptitle('Mike Trout Hits per year')
plt.barh(year, hits, color='red')       #notice I changed the color argument. Blue is the default color
plt.show()

## Line Plot

We can also do simple line plots. Here is hits per year as a line plot.

In [None]:
plt.xlabel('Year')
plt.ylabel('# of Hits')
plt.grid()
plt.plot(year, hits)
plt.show()

## Combined plots

You can also put them together. 
In this plot, I have the # of hits plotted in blue as a bar chart, and number of At Bats in red as a line graph. 

But notice, our old labels don't work anymore!

In [None]:
plt.xlabel('Year')
plt.ylabel('# of Hits')
plt.plot(year, at_bats, color='red')
plt.bar(year, hits)
plt.show()

## Legends

A legend is probably the right thing to bring more clarity to our plot

In [None]:
plt.xlabel('Year')
plt.suptitle('Mike Trout - At Bats and Hits per Year')
plt.plot(year, at_bats, color='red', label='At Bats')
plt.bar(year, hits, label='Hits')
plt.legend()         #makes the legend happen!
plt.show()

## Stacked Bar Chart

We can stack bar charts on top of eachother

In this chart, I am literally stacking home runs on top of hits. But you can get a visual picture on the ratio of home runs to overall hits

In [None]:
plt.xlabel('Year')
plt.suptitle('Mike Trout - Home Runs vs Total Hits')


plt.bar(year, hits, label='Hits')
plt.bar(year, home_runs, label='Home Runs')

plt.legend()
plt.show()

## Grouped Bar Chart

In order to have my bar charts side by side, I need to move one of them to the side, and also make the bars skinnier so that everything fits

In [None]:
plt.xlabel('Year')
plt.suptitle('Mike Trout - Home Runs vs Total Hits')

plt.xticks(rotation=45)         #rotates labels by 45 degrees
plt.xticks(year)                #shows all years in label

plt.bar(year, hits, width=.2, label='Hits')
plt.bar(year+.2, home_runs, width=.2, label='Home Runs')        #moved the bars around manually
plt.legend()
plt.show()

## Labels

I can add labels on my figures to show exact values

In [None]:
plt.xlabel('Year')
plt.xticks(rotation=45)
plt.xticks(year)                #shows all years in label

plt.ylabel('# of Hits')           
plt.suptitle('Mike Trout Hits per year')

for bar in plt.bar(year, hits):        
    plt.text(bar.get_x() + .4,              #x position of label
             bar.get_height() - 20,           #y position of label
             bar.get_height(),              #actual value of label
             ha='center',
             va='bottom')


Remember, you can do math on the fly with your dataframe objects!

In [None]:
cost_per_home_run = salary/home_runs

print(type(cost_per_home_run))
print(cost_per_home_run)

In [None]:
fig, ax = plt.subplots()

plt.xlabel('Year')
plt.xticks(rotation=45)
plt.xticks(year)

formatter = ticker.FormatStrFormatter('$%.0f')     #formatting y axis as dollar amounts
ax.yaxis.set_major_formatter(formatter)

plt.ylabel('Price')           
plt.suptitle('Mike Trout Yearly Cost Per Home Run')
plt.bar(year, cost_per_home_run)
plt.show()

## Scatter Plot

Now I'll give you some other examples of random plots, just to give you more ideas of what is possible

In [None]:
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
print(x)
area = np.pi*3
print(area)

In [None]:
plt.scatter(x, y, s=area, alpha=0.5)
plt.title('Scatter plot pythonspot')