<img src = "tewnumpypandas.png" height="550px" width="550px"/>

## 4. Concatenate Data and Transform Data in Python
This is really important in data analysis. These are useful for getting your data into the structure and order that you need for analysis.<br>
Concatenation is simply combining data from separate sources while transformation is converting and reformatting data to the format necessary for your purposes. When you transform data, you convert it into the format that's required to facilitate analysis.<br>
In this tutorial, we are going to learn how to...<br>
 - Drop Data
 - Add Data &
 - Sort Data

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
#Our first df
df = DataFrame(np.arange(36).reshape(6,6))
df

Unnamed: 0,0,1,2,3,4,5
0,0,1,2,3,4,5
1,6,7,8,9,10,11
2,12,13,14,15,16,17
3,18,19,20,21,22,23
4,24,25,26,27,28,29
5,30,31,32,33,34,35


In [3]:
#Our second df
df02 = DataFrame(np.arange(15).reshape(5,3))
df02

Unnamed: 0,0,1,2
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11
4,12,13,14


#### Concatenating Data
To concatentae data you use the concat method. This method joins data from separate sources and combines them into one df. If you want to join df's based on their rows then you need to use the *<font color = blue>axis = 1</font>* argument to the concat method. This tells Python to concatenate the df's by merging them along the row index values. 

In [4]:
pd.concat([df, df02], axis = 1)

Unnamed: 0,0,1,2,3,4,5,0.1,1.1,2.1
0,0,1,2,3,4,5,0.0,1.0,2.0
1,6,7,8,9,10,11,3.0,4.0,5.0
2,12,13,14,15,16,17,6.0,7.0,8.0
3,18,19,20,21,22,23,9.0,10.0,11.0
4,24,25,26,27,28,29,12.0,13.0,14.0
5,30,31,32,33,34,35,,,


To concatenate our df's based on the columns, we leave out the *<font color = blue>axis = 1</font>* argument as concatentaing by columns is Python's/Panas' default behaviour.

In [5]:
pd.concat([df, df02])

Unnamed: 0,0,1,2,3,4,5
0,0,1,2,3.0,4.0,5.0
1,6,7,8,9.0,10.0,11.0
2,12,13,14,15.0,16.0,17.0
3,18,19,20,21.0,22.0,23.0
4,24,25,26,27.0,28.0,29.0
5,30,31,32,33.0,34.0,35.0
0,0,1,2,,,
1,3,4,5,,,
2,6,7,8,,,
3,9,10,11,,,


We can see that the second df has simply been added to the bottom of the first. Where there is no data, because the two df's are of different sizes, we simply get NaN's as a fill.
#### Transforming the data
**Dropping Data**<br>
Due to the difference in size/format of the two df's that were concatenated, you often need to reformat the resulting df. This is typically done by dropping rows that have no data. Here we are going to drop the rows that have values of zero and two, on one of the original df's before we do anything else...

In [6]:
df.drop([0,2])

Unnamed: 0,0,1,2,3,4,5
1,6,7,8,9,10,11
3,18,19,20,21,22,23
4,24,25,26,27,28,29
5,30,31,32,33,34,35


The rows with the series index values of 0 & 2 have been dropped from one of the original df. To drop columns, instead of rows, simply pass in the *<font color = blue>axis = 1</font>* argument...

In [7]:
df.drop([0,2], axis = 1) # To drop columns rather than rows with 0 & 2

Unnamed: 0,1,3,4,5
0,1,3,4,5
1,7,9,10,11
2,13,15,16,17
3,19,21,22,23
4,25,27,28,29
5,31,33,34,35


**Adding Data**<br>
First we will create a Series object and then add it to a df...

In [12]:
series = Series(np.arange(6))
series.name = 'adding' # Have we just named our series?
series

0    0
1    1
2    2
3    3
4    4
5    5
Name: adding, dtype: int32

A good way to add data to a df is to join df's. To do this, you simply need to call the join() function on them...

In [13]:
added = DataFrame(df, series)
added

Unnamed: 0_level_0,0,1,2,3,4,5
adding,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,1,2,3,4,5
1,6,7,8,9,10,11
2,12,13,14,15,16,17
3,18,19,20,21,22,23
4,24,25,26,27,28,29
5,30,31,32,33,34,35


A column has iundeed been added to our opriginal df but, for some reason, it has been added as the index which is not what is supposed to happen.<br><br>
**Append Method**<br>
Another way to add data to a df is to use the append() method. This allows will add to the bottom of your current df...

In [16]:
added_df = added.append(added, ignore_index = False) # ignore_index tell Python not to re-index
added_df

Unnamed: 0_level_0,0,1,2,3,4,5
adding,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,1,2,3,4,5
1,6,7,8,9,10,11
2,12,13,14,15,16,17
3,18,19,20,21,22,23
4,24,25,26,27,28,29
5,30,31,32,33,34,35
0,0,1,2,3,4,5
1,6,7,8,9,10,11
2,12,13,14,15,16,17
3,18,19,20,21,22,23


Still not sure why my adding column is showing upo as the index when his is the last column in the df, otherwise our results are the same.<br><br>
**Re-Indexing your df**<br>
It is always a good idea to re-index your df, so we are going to do that now and see what results we get...

In [17]:
added_df = added.append(added, ignore_index = True) # ignore_index tell Python not to re-index
added_df

Unnamed: 0,0,1,2,3,4,5
0,0,1,2,3,4,5
1,6,7,8,9,10,11
2,12,13,14,15,16,17
3,18,19,20,21,22,23
4,24,25,26,27,28,29
5,30,31,32,33,34,35
6,0,1,2,3,4,5
7,6,7,8,9,10,11
8,12,13,14,15,16,17
9,18,19,20,21,22,23


The re-indexing means that our index now has all unique values but my adding column hasn't shown up as it should

#### Sorting Data
To be able to sort our data we use the sort_values() method. With this method, you always pass in the by argument as this tells Python what column you want the df to be sorted by...  

In [18]:
#To get Python to sort the df by column 5 & in descending order
sorted = df.sort_values(by = [5], ascending = [False])
sorted

Unnamed: 0,0,1,2,3,4,5
5,30,31,32,33,34,35
4,24,25,26,27,28,29
3,18,19,20,21,22,23
2,12,13,14,15,16,17
1,6,7,8,9,10,11
0,0,1,2,3,4,5


Note that even tho we sorted by Column 5, all of the columns now contain data that has been sorted in descending order