### Creating a Pandas Dataframe

* [From a list of tuples](#create-first)
* [From a dictionary](#create-second)
* [Loading a CSV file](#create-third)
* [Built-in datasets](#create-fourth)


In [None]:
# This is the customary way of importing pandas
import pandas as pd


#### (a) From a list of tuples <a class="anchor" id="create-first"></a>

In [None]:
name = ['Bob','Jessica','Mary','John','Mel']
age = [16, 35, 77, 57, 23]

people = list ( zip(name,age))
people

In [None]:
df = pd.DataFrame(data=people, columns=['Name','Age'])

In [None]:
df

#### (b) From a dictionary <a class="anchor" id="create-second"></a>

In [None]:
population_dict = { 'Country': [ 'China', 'India', 'United States', 'Indonesia' ],
                    'Population' : [1415045928, 1354051854, 326766748, 266794980] }

for k,v in population_dict.items():
    print (k,v)

In [None]:
df = pd.DataFrame(population_dict)

In [None]:
df.head()

#### (c) From a CSV file <a class="anchor" id="create-third"></a>

In [None]:
# The option sep="," is used to indicate field separators
df = pd.read_csv('misc/population.csv',sep=",") # The file name can be replaced with a URL
df.head()

#### (d) Built-in datasets <a class="anchor" id="create-fourth"></a>

**In-class exercise** Packages like sklearn and seaborn come with practice datasets. Create a pandas dataframe from iris dataset below.

In [None]:
from sklearn.datasets import load_iris
import pandas as pd

In [None]:
iris = load_iris()

In [None]:
iris.feature_names

In [None]:
type(iris.data)

In [None]:
# Create a pandas dataframe with iris data and with column names as above.




### Meta-data about the dataframe

In [None]:
# The number of rows and columns
df.shape

In [None]:
df.columns

In [None]:
# This command is useful to see the datatypes of columns and also check if there there are any NULL objects.
# This dataframe has none.  
df.info()

In [None]:
# Summary of numerical solumns
df.describe()

In [None]:
# How are the rows indexed? By default, pandas enumerates the rows when a csv file is loaded.
df.index

### Accessing rows and columns


In [None]:
# See the first few rows
df.head(3)

# Similar commands: 
# df.tail()
# df.sample()

**In-class exercise.** Randomly sample 20 distinct countries from the list of countries in the table.

In [None]:
# Select columns by their names
df[ ['Country Name', 'Value'] ].head() # Notice the double [[]].

In [None]:
# Get rows by their indices. This is similar slicing in lists. iloc means "integer location"

df.iloc[0:4]


In [None]:
# The general for of iloc is df.iloc[ row_indexer , col_indexer].

df.iloc[ 0:4, [1,2] ]

In [None]:
# iloc does not support label-based access. In this case we must drop use loc.
# df.iloc[ 0:1, ['Year','Value']  ] <-----Invalid


#### Set the index

In [None]:
df.set_index(keys=['Country Name'],inplace=True)

In [None]:
df.index

In [None]:
# loc is used of label-based access. Format: df.loc[ labeled_row_indexer, labeled_col_indexer]
# Both row-indexer and column indexer have to be label and cannot have numbers.

df.loc[ 'India':'Indonesia',  ['Value'] ].sample(5)


*Remark* Another way to access elements is using df.ix. We will not discuss it as it will become deprecated in the upcoming Pandas version.

#### Boolean indexing

In [None]:
df = pd.read_csv('misc/population.csv')

In [None]:
select_condition_1 = df['Value']>1000000000
# select_condition1 is now a boolean mask

In [None]:
df[select_condition_1].sample(3)

In [None]:
selection_condition_2 = df['Country Name'] == 'China'

In [None]:
# We select rows that satisfy both conditions. 
# Boolean index can be made from a combination of 
# logical operators of AND &, OR |, NOT ~.

df[select_condition_1 & selection_condition_2 ].sample(5)


In [None]:
# isin is a useful operator when building a boolean index
selection_condition_3 = df['Country Name'].isin( ['China','India'] )
df[selection_condition_3].sample(5)

In [None]:
# where is useful when we want to retain the shape of the original table.
# The values that dont match the selection critieria are set to NaN
df.where(df['Year']>2011).shape


**In-class exercise**. Retreive all the countries whose GDP is more than three standard deviations from the average.

### Modifying data

* [Adding rows](#modify-first)
* [Adding columns](#modify-second)
* [Sorting](#modify-third)

#### Adding rows <a class="anchor" id="modify-first"></a>

In [None]:
# We use a dataset containing two sets of marks for a few students

import pandas as pd

df = pd.read_csv('misc/studentmarks2.csv', sep=",", header=None)

In [None]:
df.columns = ['Name', 'Marks1', 'Marks2']

In [None]:
df

In [None]:
df.columns

In [None]:
df.info()

In [None]:
# Add two students to the existing data. We first make a dataframe out of the new data and then append
# to the old dataframe

new_data = pd.DataFrame( [ ['ram',30,40],['sana',25,60] ], columns=['Name','Marks1','Marks2'] )
df.append(new_data)


In [None]:
# Notic that there is a common index to two students. There are two ways to avoid this situation:
# Either use ignore_index=True option when appending or reset the index as shown below
df.iloc[1]

In [None]:
df.reset_index # This command remove the newly added rows

# We append the row the right way by choosing to ignore the index. 
# This method does not change df.

df.append(new_data,ignore_index=True)

#### Adding columns <a class="anchor" id="modify-second"></a>

In [None]:
#Add a column explicitly

df['Grade'] = ['Fourth','Fourth','Third',"Third","Third","Second","Second","Second","Third","Second","Second" ]


In [None]:
df.shape

In [None]:
df

In [None]:
# Adding a derived column

df['Total'] = df['Marks1'] + df['Marks2']

In [None]:
df

#### Rearranging columns

In [None]:
df = df[ ['Name','Grade','Marks1','Marks2','Total'] ]

#### Sorting <a class="anchor" id="modify-third"></a>

In [None]:
# Sorting based on a list of columns is easy. This however does not modify the dataframe.
# In order to modify the table use the option inplace=True.

df.sort_values(by=['Total','Marks1'],ascending=[False,True],)
