## Data Analysis and Visualization in Pandas and Matplotlib ##

This is our Data Visualization in Python Jupyter Notebook. We will learn to use the Pandas and Matplotlib libraries to take our data and do some data visualizations. 

The first thing we need to do is learn some background information about the libraries and technologies we will be using. Because we are all using the Anaconda software distribution today, Anaconda comes with a lot of functionality installed on top of the base python libraries. This includes the pandas and matplotlib packages as well as the JupyterLab/Jupyter Notebook Environment.

## JupyterLab/Jupyter Notebooks ##

[Project Jupyter Homepage](https://jupyter.org/)

The Jupyter environment is a web-based interactive computational environment for creating notebook-like documents. It supports several languages like python, R, Julia, etc. JupyterLab is the next generation user interface, which includes Jupyter Notebooks.

In my opinion, they seem almost exactly the same but I'm sure people embedded within the development of the project would tell you differently.

Think of the Jupyter environment as an interactive blog post. As you'll see, Jupyter allows you to show your code and explain it in a very neat, easy to follow way. Each cell either contains text (like this one) or code. When writing code, each cell basically functions like the command line or console. And as you'll see, each cell is LIVE and you can change your code on the fly. 

Jupyter really excels in situations like this class where we will be walking through a topic step by step. I can explain things, you can play with the code, and it is easy for everyone to see. 

<strong>Step 1.</strong> Let's get started by importing the libraries we will be using. I will explain these later as we go. Because we are all using Anaconda, all of these libraries are already installed.

In [None]:
!conda install --yes numpy pandas matplotlib

In [None]:
import pandas as pd
import matplotlib.pyplot as plt     #I am pretty sure pyplot is the original functionality of matplotlib
import matplotlib.ticker as ticker
import numpy as np


<strong> Step 2. </strong> First we need to read in some data, so we can then work with it. This is a CSV sheet of career stats for professional baseball player, Mike Trout. Baseball is a numbers game so this gives us a nice, easy to use dataset to work use.

In [None]:
df = pd.read_csv("MikeTroutData.csv")

## Pandas ##

[Pandas Documentation](https://pandas.pydata.org/)

Pandas is an open source python library providing high-performance, easy-to-use data structures and data analysis tools. We will be using pandas to work with our data before feeding it into Matplotlib. Pandas can read from and write to many different data formats. It is intelligent in handling missing or bad data. You can easily reshape or pivot your data. It is optimized for performance. And it has a massive international user community so help and examples are readily available. 

## Pandas Dataframes ##

The aforementioned easy to use data structure in pandas is called a [pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe). This is a tabular or spreadsheet-like view of your data, just as you'd see it in Excel. A pandas dataframe is a 2-dimensional labeled data structure with columns and rows. It is the most commonly used pandas object. Each one dimensional row or column is called a [pandas series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) Along with the data, you can pass index (row labels) or columns as arguments. 

<strong> Step 3. </strong> Let's take a look at our data

In [None]:
print(df)

<strong> Step 4. </strong> We will want to slice and dice the data so let's see how to access the data by it's column header. 

In [None]:
print(df.keys())     #a built in .keys() function
print()
print(df.columns.tolist())   # see the data in a list
print()
print(df['Year'])      #access an individual column using a dictionary syntax  (This is what I prefer)
print()
print(df.AB)           #access a column using the name as an attribute of the dataframe

<strong> Step 5. </strong> I am renaming some of the columns we will be using, just for the sake of simplicity. It is easier to refer to these variable names than the entire syntax of each column.

In [None]:
year = df['Year']
hits = df['H']
at_bats = df['AB']
home_runs = df['HR']


*** YOUR TURN *** 
Choose Another column and make your own variable with it. Then print the output.

In [None]:
## Enter your code here




<strong> Step 6. </strong> I can now use these variable names just like any other object

In [None]:
print(at_bats)

<strong> Step 7. </strong> We can also create new columns. We will start with a blank one. 

In [None]:
df['new_column'] = np.nan

In [None]:
print(df['new_column'])

<strong> Step 8. </strong> We can delete columns. I'll delete the nonsense column I just created, but more commonly this is used to clean your datasets of extraneous data

In [None]:
del df['new_column']

In [None]:
print(df['new_column'])      #this will result in an error

<strong> Step 9. </strong> So far, we have only used our columns to access data and neglected rows. Rows are indexed starting at 0. If you think of other python objects, such as lists, the same concept applies. So the first row in the data would be in position 0 (the header row is excluded and treated separately).

Accessing rows in pandas is done using the .loc() and .iloc() functions and is slightly more involved than just using the column header to access data. We will start with .iloc(). 

In [None]:
print(df.iloc[:5])    #prints first 5 rows of data, notice the index row to the left of the data

In [None]:
#You can also use the .head() method

print(df.head(10))

<strong> Step 10. </strong> You can do slicing and similar operations just as you would with a python list using the .loc() function

In [None]:
print(df.iloc[2:3])      #prints only row at index 2  
print()
print(df.iloc[5:])     #prints everything row 5 and up

<strong> Step 11. </strong> The .loc() functions works somewhat counterintuitively but makes sense once you get the hang of it. Basically, you are accessing a row based on the value located in a column. See the following examples.

In [None]:
young_age = df.loc[df['Age'] < 22]

print(young_age)

<strong> Step 12. </strong> So you see above, I have effectively located the data for rows in which the column value is less than 22. Let's do another example.

In [None]:
high_batting_average = df.loc[df['BA'] > .320]

print(high_batting_average)

<strong> Step 13. </strong> One more example, let's create a new column and write data to it using a .loc() statement. We can actually do this all in one statement which I'll first show you and then explain. 

In [None]:
df.loc[df['BA'] > .320, 'High Batting Average'] = 'Yes'

example = df[['BA', 'High Batting Average']]
print(example)

<strong> Step 14. </strong> The above statement is a little more complicated. But as you can see, the first part is what we already did above. I selected rows with a batting average of > .320.  The second argument of this statement actually gives a name to the new column ('High Batting Average') and then populates it with the value of 'Yes' if the statement is true. So the interpreter iterates through each row of the dataframe to evaluate this statement. If it is true, the value 'Yes' is written to the new column. 


Pandas selection statements can get very tedious and there are endless variations and much more functionality than I have demonstrated. But for now, let's move on to visualizing the data.

*** YOUR TURN *** Make your own column and fill it with some data, as we did above with the 'High Batting Average' column. This new column can contain nonsense data, this is just an example

In [None]:
## Enter your code here




## Data Visualization ##

Finally, we get to the point where we can see our data in other ways than just a tabular format! Luckily for us, there are many data visualization libraries available in python. We will learn about just a few of the major ones, in particular Matplotlib. 

## Matplotlib ##

[Matplotlib documentation](https://matplotlib.org/)

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
Matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc., with just a few lines of code. It integrates very nicely with Pandas, NumPy, and other related libraries.

From [MatPlotLib's Wikipedia page](https://en.wikipedia.org/wiki/Matplotlib): Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications


## Bar Plots

<strong> Step 15. </strong> Let's get to it! This is a very simple plot of Mike Trout's hits per year. As you can see, I am using objects I already defined previously. Specifically these objects are columns from the Mike Trout stats dataframe (df). Previously I had defined the specific columns as their own variable name (year and hits). 

In [None]:
plt.bar(year, hits)
plt.show()

<strong> Step 16. </strong> So as you see, I've got # of Hits on the Y Axis, and Year on the X Axis. Matplotlib provides many functions to produce different charts and plots such as plt.bar() shown above. The plt.show() function is needed to show the plot in the jupyter notebooks/ipython environment. But what are the year and hits objects?

In [None]:
print(type(year))
print(type(hits))

<strong> Step 17. </strong> As you see, these are pandas Series objects. Again, a series is a 1-Dimensional array of data. I'll be transforming my pandas dataframe to extract different series of objects, so I can plot them using matplotlib. You could also call these columns explicitly.


In [None]:
plt.bar(df['Year'], df['H'])
plt.show()

*** YOUR TURN *** Make a bar plot showing the 'Year' on the x axis and another variable (column) on the y axis.

In [None]:
## Enter your code here




<strong> Step 18. </strong> Our first plot was as basic as it gets. Let's add some labels to make it look a little better.

In [None]:
plt.xlabel('Year')
plt.ylabel('# of Hits')
plt.suptitle('Mike Trout Hits per year')
plt.bar(year, hits)
plt.show()

## Horizontal Bar Plots

<strong> Step 19. </strong> Let's turn our bar plot sideways. We do this using the plt.barh() function.

In [None]:

plt.xlabel('# of Hits')
plt.ylabel('Year')
plt.suptitle('Mike Trout Hits per year')
plt.barh(year, hits, color='red')       #notice I changed the color argument. Blue is the default color
plt.show()

## Line Plot

<strong> Step 20. </strong> We can also do simple line plots. Here is hits per year as a line plot. 

**Disclaimer** I do not recommend displaying this data as a line plot. This is bad data visualization! I will improve this later.

In [None]:
plt.xlabel('Year')
plt.ylabel('# of Hits')
plt.grid()                    #I added a background grid
plt.plot(year, hits)
plt.show()

plt.xlabel('Year')
plt.ylabel('# of Hits')
plt.bar(year, hits)
plt.plot(year, at_bats, color='red')
plt.show()## Combined plots

<strong> Step 21. </strong> You can also put them together. 
In this plot, I have the # of hits plotted in blue as a bar chart, and number of At Bats in red as a line graph. 

But notice, our old labels don't work anymore!

In [None]:
plt.xlabel('Year')
plt.ylabel('# of Hits')
plt.bar(year, hits)
plt.plot(year, at_bats, color='red')
plt.show()

## Legends

<strong> Step 22. </strong> A legend is probably the right thing to bring more clarity to our plot. This is a simple process. By adding a label argument to each plot function, the legend reads these. Lastly, the plt.legend() function is needed to show the legend on the map.

In [None]:
plt.xlabel('Year')
plt.suptitle('Mike Trout - At Bats and Hits per Year')
plt.plot(year, at_bats, color='red', label='At Bats')
plt.bar(year, hits, label='Hits')
plt.legend()         #makes the legend happen!
plt.show()

*** Your Turn *** Make a plot showing the year on the x axis, a line showing games played, and a bar showing home runs. Change colors if you like and also create a legend with these items

In [None]:
## Enter your code here




## Stacked Bar Chart

<strong> Step 23. </strong> We can stack bar charts on top of eachother

In this chart, I am literally stacking home runs on top of hits. But you can get a visual picture on the ratio of home runs to overall hits

In [None]:
plt.xlabel('Year')
plt.suptitle('Mike Trout - Home Runs vs Total Hits')

plt.bar(year, hits, label='Hits')
plt.bar(year, home_runs, label='Home Runs')

plt.legend()
plt.show()

## Grouped Bar Chart

<strong> Step 24. </strong> In order to have my bar charts side by side, I need to move one of them to the side, and also make the bars skinnier so that everything fits.

In [None]:
plt.xlabel('Year')
plt.suptitle('Mike Trout - Home Runs vs Total Hits')

plt.xticks(rotation=45)         #rotates labels by 45 degrees
plt.xticks(year)                #shows all years in label

plt.bar(year, hits, width=.2, label='Hits')
plt.bar(year+.2, home_runs, width=.2, label='Home Runs')        #moved the bars around manually
plt.legend()
plt.show()

## Labels

<strong> Step 25. </strong> I can add labels on my figures to show exact values. This is more complicated as you see I have included a loop. I had to google for examples of this and apply it for my own needs. This shows you that because there is so much functionality available in Matplotlib, you can customize your plot to look any way you want. But it can get complicated. Just remember, there is a huge user community on sites such as StackOverflow, personal blogs, etc for you to tap into. 

In the loop below, this is constructing the unique value of each column. I iterate through each bar and construct the text and position of each bar.

In [None]:
plt.xlabel('Year')
plt.xticks(rotation=45)
plt.xticks(year)                #shows all years in label

plt.ylabel('# of Hits')           
plt.suptitle('Mike Trout Hits per year')

for bar in plt.bar(year, hits):        
    plt.text(bar.get_x() + .4,              #x position of label
             bar.get_height() - 20,         #y position of label
             bar.get_height(),              #actual value of label
             ha='center',
             va='bottom')


<strong> Step 26. </strong> Remember, you can do math on the fly with your dataframe objects! Let's create a new column on the fly and use it for our next examples. This is the amount of money Mike Trout is paid per home run.

In [None]:
salary = df['Salary']
cost_per_home_run = salary/home_runs

print(type(cost_per_home_run))
print(cost_per_home_run)

<strong> Step 27. </strong> In the following cell, I formatted the y axis labels and to do so used the Matplotlib ticker class (this is imported in our first cell with the other import statements). String formatting is not something I do often and I had to look for an example of how to do it. I knew I wanted to represent the dollar amounts in this situation, so again I googled for an answer. 

In [None]:
fig, ax = plt.subplots()

plt.xlabel('Year')
plt.xticks(rotation=45)
plt.xticks(year)

formatter = ticker.FormatStrFormatter('$%.0f')     #formatting y axis as dollar amounts
ax.yaxis.set_major_formatter(formatter)

plt.ylabel('Price')           
plt.suptitle('Mike Trout Pay Per Home Run')
plt.bar(year, cost_per_home_run)
plt.show()

## Scatter Plot

<strong> Step 28. </strong> Now I'll give you some other examples of random plots, just to give you more ideas of what is possible. This next cell generates 50 random numbers to use in a scatter plot.

In [None]:
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
print(x)
area = np.pi*3
print(area)

In [None]:
plt.scatter(x, y, s=area, alpha=0.5)
plt.title('Scatter plot pythonspot')
plt.show()

## Other Plotting Packages ##

Matplotlib is not your only option! You may find you want different functionality, more advanced graphics, the desire to use what you already know from other languages, or are curious to explore what else is available. Here is a brief overview of some other packages

## ggplot for python ##

[ggplot homepage](http://ggplot.yhathq.com/)

ggplot is a plotting system for Python based on R's ggplot2 and the Grammar of Graphics. The ggplot python library evolved out of the ggplot2 R-specific package. It seems to be accepted that ggplot2 (in R) is a more sophisticated graphics tool and provides more high end functionality. It is not clear to me if ggplot for python integrates all the functionality that ggplot2 has in R.


## Seaborn ##

[Seaborn homepage](https://seaborn.pydata.org/)

Seaborn is a python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. It seems to be accepted as an extension to matplotlib functionality, particularly for statistical visualization.

## Bokeh ##

[Bokeh homepage](https://docs.bokeh.org/en/latest/index.html)

Bokeh is different in that it does not depend on matplotlib and is geared toward generating visualizations in the web browser. It is meant to make interactive web visualizations.

## Which one should I use? ##

There is no right or wrong answer. It depends what you are doing, what you are familiar with, or other influences in your life. Matplotlib is a good jack of all trades package for relatively basic plotting and graphing. It also integrates nicely with numpy and pandas, two other very common scientific packages.

All these packages have large user communities and good documentation. My advice is to choose one you like and stick with it unless you find it does not have the functionality you are looking for.

Reasons to use any given data visualization package/tool in python:

- You are already familiar with it
- Your advisor/professor already likes one and you live with that decision
- You inherited code that is already using that package
- You found a code example you liked online for a specific package

## Self Help - You don't need to remember all of this! ##

Here are a few resources I use when looking for code examples, solutions, etc.

ChatGPT
* ChatGPT has quickly made huge changes to the programming landscape. It is a hugely powerful tool **If you use it the right way!**. I think it is a somewhat slippery slope of how to advise new programmers to use ChatGPT (or other AI tools) so I will refer to some best practices. My personal opinion is that you should use AI minimally when you are starting. When you have a better grasp of basic fundamentals, then you can include AI and greatly increase your speed. **Never accept ChatGPT code verbatim!** Always double check it before including it in your workflows.
* [How to Effectively Learn to Program w/ ChatGPT](https://towardsdatascience.com/how-to-effectively-start-coding-in-the-era-of-chatgpt-cfc5151e1c42)
* [Corey Schafer's "How to use ChatGPT"](https://www.youtube.com/watch?v=jRAAaDll34Q)

Google 
* Ex: "How to make dictionary python" 
* Ex: "python decorators"

[Stack Overflow](https://stackoverflow.com/) 
* A question/answer site for programming questions (actually, not just programming any more) 
* Not only python 
* DO NOT just ask questions, do your research first! 
* Odds are very high someone has already asked your question, especially as a novice

[Youtube - Corey Schafer](https://www.youtube.com/channel/UCCezIgC97PvUuR4_gbFUs5g) 
* If you have a question about a python programming concept, Corey Schafer has covered it

[Practice Python](http://www.practicepython.org/) 
* Coding challenges for programmers of all levels

[Python Tutor](http://pythontutor.com/) 
* Visualize what your code is doing step-by-step 
* Has limitations once you start importing libraries
