# Import Packages
First, we import pandas package and dataset module from sklearn package to get the dataset for this tutorial. 

In [21]:
%qtconsole

In [22]:
import pandas as pd
from sklearn import datasets

# Load Dataset
Load iris dataset from datasets module that was imported. Iris is a famous dataset in the world of statistics and has been used for numerous tutorials.

If you run ```print(iris)``` you will see that it returns a dictionary where the first key, value pair is "data" and a 150x 4 numpy array. The second key, value pair is "target" and list of integer values. The last key, value pair is "feature_names" and a list of names. We shall be using these 3 components to build our pandas dataframe that will be used in the rest of this tutorial.

In [23]:
# import some data to play with
iris = datasets.load_iris()
#print(iris)
print("Iris data shape :", iris['data'].shape)
print('\n')

Iris data shape : (150, 4)




# Create DataFrame
Dataframes are most useful data stuctures in python, they are 2 -dimensional tabular data structures with column names, row names or index and data.
We use function ```DataFrame()``` to convert the numpy array into a pandas dataframe. This is the only function to create a dataframe in pandas. We can check the columns created with the ```.columns()``` function. 

In [24]:
# Create data frame
df=pd.DataFrame(iris.data,columns=iris.feature_names)
print(df.columns)


Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')


# Add new column
Next we see how a new column "target" can be added to an existing dataframe. A new column can be created by assigning a list. This is one of many other ways, but this is the most common way of doing this. Learn about other ways in this excellent article - https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/

In [25]:
# Lets add another column to the dataset
df['target']=iris['target']
print("Checking top records: ")
print(df.head())
print('\n')

Checking top records: 
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  




# Recode column
Then we create yet another new column "Species" by mapping values from column "target" to respective species names using dictionary "mapping_species" with ```map()``` function. This is also known as recoding a column. Another way to this would be by using ```.replace()``` which gives the same result. We create another version of "Species", "Species_2" by using ```.replace()```.


In [26]:

# Use map to recode column
species_mapping={0:iris['target_names'][0],1:iris['target_names'][1],2:iris['target_names'][2]}
print('Checking species mapping: ')
print(species_mapping)
print('\n')

df['Species'] = df['target'].map(species_mapping)
print("Checking if Species is created : ")
print(df.head())
print('\n')

# Use .replace to recode column
df['Species_2'] = df['target'].replace(species_mapping)
print("Checking if Species_2 is created : ")
print(df.head())
print('\n')



Checking species mapping: 
{0: 'setosa', 1: 'versicolor', 2: 'virginica'}


Checking if Species is created : 
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target Species  
0       0  setosa  
1       0  setosa  
2       0  setosa  
3       0  setosa  
4       0  setosa  


Checking if Species_2 is created : 
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7

# Drop columns
Columns can be dropped by using the ```.drop()``` function with option ```axis=1``` specifying that we want to drop a column. Checking the column names again shows that "Species_2" has been dropped. We can use ```tolist()``` function to convert columnn names, which is a series to list, this is not necessary but just nicer to look at.

In [27]:
df.drop(['Species_2'],axis=1,inplace=True)
print("Checking if Species_2 is dropped : ")
print(df.columns.tolist())
print('\n')

Checking if Species_2 is dropped : 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'target', 'Species']




# Read Dataset

Here we have loaded the iris dataset from datasets module of sklearn package so that anyone can run the script. But in actual work scenario you would need to read in the dataset either from a server or from your local folder. Keeping that in mind lets see how it would work if you have the same file stored in a local folder. For this we shall first write out the dataframe created above and then read it back. This way its reproducible for everyone.

Notice when we write out the dataframe with ```.to_csv()``` we specify option ```index=False```, if this is not done an unnamed column with row indexes will be created in the written out file. You can try to write out wihtout this option.

The most common way to read files with Pandas is with ```read_csv()``` function. This function has many parameters that can be used to specify how data needs to be read. For example, by default the first row of the data will be considered to be the header and used to create column names, if the file does not have any headers then specify ``` headers=None```, unless we specify column names with option ```names``` the dataframe will have unnamed columns.Take a look at pandas documentation to learn more about the available options when reading in data -
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html


In [28]:
df.to_csv('Iris.csv',index=False)
iris_df = pd.read_csv('Iris.csv')
print(iris_df.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target Species  
0       0  setosa  
1       0  setosa  
2       0  setosa  
3       0  setosa  
4       0  setosa  


# Create DataFrame from scratch

Though not needed that often it is useful to know how to create a pandas dataframe from scratch. We can create a dataframe from a dictionary with values as lists, where dictionary keys are the dataframe column names and the values are column values. There are options to specify datatypes and index values. Take a look at pandas documentation to learn more - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html



In [29]:
# create dictionary
dat_dict = {'Name':['El','Mike','Dustin','Lucas','Max','Will'],'Gender':['Girl','Boy','Boy','Boy','Girl','Boy']}
# convert to dataframe
dat_df = pd.DataFrame(dat_dict)
# check output
print(dat_df.head())


     Name Gender
0      El   Girl
1    Mike    Boy
2  Dustin    Boy
3   Lucas    Boy
4     Max   Girl


# Inspecting a DataFrame

Now that we have learnt how to create a pandas dataframe be it from an existing loaded dataset or reading in an external file or from scratch, lets inspect some properties of a dataframe. ```type()``` function returns the class type of "dat_df" as pandas dataframe and that of column "Name" as pandas series. We have not introduced series so far, series are another type of data stucture in pandas, they are one dimensional arrays with labels. Each column of a dataframe is a series as you can see from running ```print(type(dat_df['Name']))```. You can create a series from a numply array in same way as a dataframe from a dictionary just by replacing ```.DataFrame()``` function with ```.series()```. Finally we can insert a series as a new column into a dataframe.

In [30]:
# Check class type
print('Dataframe class :', type(dat_df))
print('Dataframe column class :', type(dat_df['Name']))

# import numpy package
import numpy as np
# create numpy array
Sibling = np.array([None,'Nancy',None,'Erika','Billy','Jonathan'])
# convert to pandas series
Sibling = pd.Series(Sibling)
print('Series :',Sibling)

# insert series as new column into dataframe
dat_df['Sibling']=Sibling
print("New dataframe :")
print(dat_df.head())

Dataframe class : <class 'pandas.core.frame.DataFrame'>
Dataframe column class : <class 'pandas.core.series.Series'>
Series : 0        None
1       Nancy
2        None
3       Erika
4       Billy
5    Jonathan
dtype: object
New dataframe :
     Name Gender Sibling
0      El   Girl    None
1    Mike    Boy   Nancy
2  Dustin    Boy    None
3   Lucas    Boy   Erika
4     Max   Girl   Billy
