<a id="pandas_create_dataframes"></a>
## Working with DataFrames in Pandas

Now that we know how to import dataframes we will now cover how to work with them. You can think of a dataframe as an excel spreadsheet. Each row of a datframe corresponds to measurements or values of an instance, while each column is a vector containing data for a specific feature. This means that a dataframe's rows do not need to contain the same type data type (i.e. they can be a float, integer, string, etc). This will be the most common data structure that you will work with.

##### Manually Creating DataFrame
As opposed to reading in a file or connecting to a database, you can also create your own dataframe using [```DataFrame```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

In [10]:
import numpy as np
import pandas as pd

AgeData = pd.DataFrame({'Name': ['Jeff','Bill','Nancy','Greg','Jill','Anne','Doug'], 
              'Age': [15,88,41,31,78,9,6]}) #Creating dataframe
AgeData

Unnamed: 0,Name,Age
0,Jeff,15
1,Bill,88
2,Nancy,41
3,Greg,31
4,Jill,78
5,Anne,9
6,Doug,6


<a id="pandas_examining_data"></a>
## Examining Data 

Once you have your dataframe you will want to inspect it. Below are a few ways of getting quick information from your data frame. 
- [```df.head()```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) returns the first five rows
- [```df.tail()```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail) returns the last five rows
- [```df.info()```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) returns information on your dataframe such as columns, data types, RangeIndex, and memory usage
- [```df.columns```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html) returns the columns in the dataframe.
- You can also run various math on your dataframe such as [```df.mean()```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html),[```df.max()```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html),[```df.min```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.min.html#pandas.Series.min),[```df.median```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html),[```df.std()```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html),[```df.corr()```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html), and many more. These methods will return a value for each of your columns.

    

##### Creating Random DataFrame for Examples

In [21]:
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,5000,size=(100, 4)), columns=list('ABCD')) # 100x4 dataframe

Use the above methods you just learned to explore the example dataframe in the workspace below.

In [32]:
df.median()

A    2332.0
B    2421.0
C    2266.5
D    2166.0
dtype: float64

<a id="pandas_selecting_data"></a>
## Selecting Data

Pandas dataframes makes it easy to slice and dice our data in any way would would like. 
- Selecting one column is as easy a putting the column you want in bracket after the dataframe name ```df['Col1']```.
- To select more than one column put the columns you want into a list then place that list into brackets after the dataframe name ```df[['Col1','Col2','Col3']]```.
- You can also select data based on its position [```iloc```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) or by its label using [```loc```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc).
    - ```df.loc['USA']``` will print out observations for USA
    - ```df.loc[['USA','UK']]``` will print out observations for USA and UK
    - ```df.iloc[4]``` will print out the observation in the 4th index 
    - ```df.iloc[[4,10]]``` will print out the observation in the 4th and 10th index 

<a id="pandas_data_cleaning"></a>
## Data Cleaning 

Pandas also does a great job when it come to cleaning your dataframe.
<a id="pandas_data_cleaning_missing_values"></a>
- __Missing Values:__
    - One of the first steps you want to do when import a dataframe is check for null values. This can be done by calling   ```df.isnull()```. This will return a Boolean array of your dataframe where True equals a null value. If you want the total of null values by column simply place the ```sum()``` method at the end of the statement [```df.isnull().sum()```](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.isnull.html) (You can perform math on Booleans in Python, True will be treated as ones and False as zeros).
    - You can drop all missing values using [```df.drop_na()```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html), the arguement ```axis=0``` will drop the rows and ```axis=1``` the columns.
    - You can impute the missing values using [```df.fillna(x)```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) where ```x``` can be any value. Here you would be using the mean to replace missing values in ```Col1``` ```df['Col1'].fillna(df['Col1'].mean())```. 
<a id="pandas_data_cleaning_replacing_values"></a>    
- __Replacing Values:__
    - Sometimes you want to replace a value in dataframe. To do this in pandas you use the [```replace```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) method. ```df.replace('Bos','Boston')``` would replace Bos with Boston. This also works with multiple values just use lists, ```df.replace(['Bos','NYC'],['Boston','New York City'])```.
<a id="pandas_data_cleaning_renaming_lists"></a>    
- __Renaming Lists:__
   - Data will often come without, incorrect, or long column names. You can rename columns by feeding a new list to [```df.columns```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html), but this will require you to have a list of the same length as the current column list. The best way to rename columns is to use the [```rename```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) method.```rename``` requires a dictionay of the old column name as the key and new column name as the value pair ' ```df.rename({'Col1' : 'Column_1', 'Col2' : 'Column_2'})```.

<a id="pandas_joining_data"></a>
## Joining Data

There are three main methods around joining data in pandas: [```append```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html), [```concat```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html), and [```join```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html).
   - ```df1.append(df2)``` will stack the two dataframes df1 on top of df2. Make sure the columns are identical.
   - ```df1.concat(df2,axis=1)``` this will add the columns of df2 to the end of df1. Make sure the rows are identical.
   - ```df1.join(df2,on = ['Col1'],how = 'inner')``` This joins the columns in df1 with the columns in df2 were ```Col1``` have matching values. You can also do left, right, outer, and inner joins. You can also join on multiple values.