# More Advanced Pandas & Intro to Matplotlib

In [1]:
#load libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Reviewing .iloc/.loc/.ix 

*.iloc* = iloc works based on integer positioning( EXCLUDES stop range value)<br> </br>
*.loc* = loc works on named indices (includes the stop range value ) <br> </br>
*.ix* = ix works on mixed types (includes the stop range value )<br> </br>

check: https://www.youtube.com/watch?v=xvpNA7bC8cs&t=1166s

# Working with missing data

Pandas uses 'NaN' (not a number) as its default missing value argument. Missing data is considered values in a dataset that is not present. We must explicitly tell Pandas what we are calling a missing value.

In [2]:
df2 = pd.DataFrame({'a':[1,2,3,4,5],'b':[10,'NaN',30,40,50],'c':[100,200,'NaN','NaN',500],'d':['NaN','NaN','NaN',4000,5000]})
df2


Unnamed: 0,a,b,c,d
0,1,10.0,100.0,
1,2,,200.0,
2,3,30.0,,
3,4,40.0,,4000.0
4,5,50.0,500.0,5000.0


In [None]:
# Count the number of missing values in Each row DEMO


Null needs to be specified; We must always check to see how pandas interprets missing values

*Review how we loaded the files from last week and how we explicitly told pandas where the missing values are*

In [None]:
# We can use the numpy to specify
df2 = df2.replace('NaN',np.nan)
print(df2.isnull())

In [None]:
# create a new column and sum the number of missing values across the row
df2['Missing_Values'] =df2.isnull().sum(axis=1)
df2

# More Advanced Data Wrangling in Pandas


PANDAS provides various facilities for easily combining together Series, DataFrame, etc objects with various kinds of set logic for the indexes and relational functionality in the case of join / merge-type operations.

# <font color='red'> Concatenating Dataframes </font>


**Concat**:  concat function in Pandas is used to append either columns or rows from one DataFrame to another. When we concatenate our DataFrames we simply add them to each other - stacking them either vertically or side by side.


### <font color='blue'> Concat Example </font> 

In [None]:
raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])
df_a

In [None]:
raw_data2 = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
df_b = pd.DataFrame(raw_data2, columns = ['subject_id', 'first_name', 'last_name'])
df_b

In [None]:
raw_data3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
df_n = pd.DataFrame(raw_data3, columns = ['subject_id','test_id'])
df_n

In [None]:
# concatenate 2 dataframes along rows
df_byrow = pd.concat([df_a, df_b])

df_byrow

In [None]:
# join two datarames along columns
df_bycol = pd.concat([df_a, df_b], axis=1)
df_bycol

In [None]:
# how would we concat the columns of df_n to df_a ?

## <font color='red'> Merging Dataframes </font>

**Merge**:  Another way to combine DataFrames is to use columns in each dataset that contain common values (a common unique id). Combining DataFrames using a common field is called “joining”. The columns containing the common values are called “join key(s)”. Joining DataFrames in this way is often useful when one DataFrame is a “lookup table” containing additional data that we want to include in the other.

<img src='initialiDF.jpg'>

<img src="merge_PD.jpg">

# What's the difference?

1. Concat can take a group of 2+ dataframes and combines the dataframes via the rows or columns. More like adding.
2. Merge allows for more SQL-like merging of two dataframes, specifically merging on similar values in a column found in two dataframes. Useful in certain cases for relational database type usage. More like mixing.

In [None]:
df_a

In [None]:
df_b

In [None]:
#merge examples with df_a and df_b; & one with the index arg
mergeright=pd.merge(df_a,df_b,on='subject_id',how='right')
mergeright

In [None]:

mergeleft = pd.merge(df_a,df_b,on='subject_id',how='left')
mergeleft

In [None]:
#merge by index example DEMO


# Groupby

1.  Groupby  splits the data into different groups depending on a variable of your choice
2. A GroupBy object .groups variable is a dictionary whose keys are the computed unique groups with corresponding values being the axis labels belonging to each group. 
3. This groupby object allows for access to the object similar to what we've seen before with Pandas

In [None]:
# dataframe that has counts of types of beverage servings 
drinks = pd.read_csv('http://bit.ly/drinksbycountry',index_col='country')

In [None]:
#inspect the column continent, how ??


In [None]:
# using .groupby function to see the beer serving mean by continent
drinks.groupby('continent').beer_servings.mean()

In [None]:
# max number of beer servings by continent
drinks.groupby('continent').beer_servings.max()

In [None]:
#Aggregate findings
drinks.groupby('continent').beer_servings.agg(['count', 'min', 'max'])

In [None]:
##Accessing objects
obj = drinks.groupby('continent').agg(['min','max'])s

# MatplotLib 

Matplotlib is a 2D plotting library in python that was designed to closely resemble Matlab;

1. Allows for access to object properties that can be modified

*** https://matplotlib.org/users/pyplot_tutorial.html *** <br> </br>
**checkout some examples : [https://matplotlib.org/examples/**

*Matplotlib* works using a hierarchy of containers that are all adjustable:
   1. **Figure** is the topmost container: it is the entire page for your plot
      -- it can contain multiple plots and axes
   2. Plotting is mostly done on the Axes container via subplot:
      -- Each Axes object has access to many other plot controls
 
 *So in general we create plots by specifying each of the containers (topmost first) then create the graphs* <br> </br>***(not necessary for basic plots)***

<img src='figure_axes_axis_labeled.png'>

In [None]:
# This code will embed our plots inside this notebook 
%matplotlib inline

#load the specific sub library we are interested in
import matplotlib.pyplot as plt

In [None]:
#Basic plot Example 1
#Create values incremented by .25 by using NumPy
x2 = np.arange(0,1.25,.25) 
plt.plot(x2,x2**2)

In [None]:
#Basic plot Example 2 with more than 1 line
#Create some new fake data to graph
X = np.linspace(-np.pi,np.pi,256,endpoint=True)
C = np.cos(X)
S = np.sin(X)

#Create a basic plot
plt.plot(X,S,'r')
#plt.show()

#Once the plot object is open, you can append a new line
plt.plot(X,C,'g')

In [None]:
# Another Example of Updating the properties of a line object;
# With new data!
x = np.arange(0,1.0,0.01) # values 0 to 1 in steps of .01
y1 = np.sin(2*np.pi*x)
y2 = np.sin(4*np.pi*x)

# note that the two lines work in one call because of same x
lines = plt.plot(x, y1, x, y2)

# The setp() function operates on a single instance or a list of instances.
plt.setp(lines,linewidth=2,color='b')

### Adjusting the x and y labels
> <div style="text-align: right"> *Use  plt.xticks & plt.yticks*</div>

In [None]:
#Example changing the axis values (note equal lengths)
x = [1, 2, 3, 4]
y = [1, 4, 9, 6]
z = [1, 8,16,24]

# create a list of labels
xlabs = ['Mark','Himanshu','Samson','Danny']
ylabs = ['Man','Hombre','Bro','Muchacho']
# set the plot up
plt.plot(x,y,'go')
#add the labels
plt.xticks(x,xlabs,rotation='vertical')
# ylab adjusted
plt.yticks(y,ylabs)

## Add sub-plots to a figure

To add a subplot we must pass several arguments to the pd.subplot() call:

using:
<div style="text-align: left">***plt.subplot(row,col,plot_num)***</div>
<div style="text-align: center">What does that mean??</div>
<div style="text-align: center">
1. rows (**1**)
2. Number of columns (**1**) 
3. plot number (**1**)
</div> 
<br>  </br>
<br>  </br>
<div style="text=align: right">*let's look at this more closely* </div>


<img src='plot_subplot-grid_1.png'>

In [2]:
#Adjust the figures and add new plots to a figure

# The first figure
plt.figure(1) 

## REMINDER USAGE: plt.subplot(nrows, ncols, plot_number)

# The first subplot of figure 1
plt.subplot(211)
#Our first plot
plt.plot(x,y)
# The SECOND subplot of figure1
plt.subplot(212) # <---
# Our SECOND plot
plt.plot(x,z,color='r')

# Add a title above the first subplot
plt.figure(1)
plt.subplot(211) #<---
plt.title('WOAH 2 seperate graphs!!')

NameError: name 'plt' is not defined

# Pandas and Matplotlib together

#### Pandas dataframes have access to the .plot function for quick plotting

In [1]:
data = drinks.groupby('continent').mean()
data.plot(kind='barh')

NameError: name 'drinks' is not defined