## DataFrame Preliminaries in Pandas

In this notebook we will learn to load the data and look at top row of the data, shape (i.e., number of rows and columns) of the data, list of name of columns, list of name of index and summary of data statistics (e.g., mean, standard deviation, median).

In second and third step we will learn to create new dataframe from numpy array and dictionary.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

#### 1. Titanic data in DataFrame

- To load data

In [2]:
titanic = pd.read_csv('data/titanic.csv')
titanic = titanic.set_index('Name')

- To see top 3 row of data in the DataFrame.

In [4]:
titanic.head(3)

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.25,,S
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
"Heikkinen, Miss. Laina",3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S


- To know shape of the DataFrame

In [3]:
titanic.shape

(891, 12)

- To find the list of column names.

In [4]:
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

- To find the list of index name.

In [5]:
titanic.index

Index(['Braund, Mr. Owen Harris',
       'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
       'Heikkinen, Miss. Laina',
       'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
       'Allen, Mr. William Henry', 'Moran, Mr. James',
       'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard',
       'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)',
       'Nasser, Mrs. Nicholas (Adele Achem)',
       ...
       'Markun, Mr. Johann', 'Dahlberg, Miss. Gerda Ulrika',
       'Banfield, Mr. Frederick James', 'Sutehall, Mr. Henry Jr',
       'Rice, Mrs. William (Margaret Norton)', 'Montvila, Rev. Juozas',
       'Graham, Miss. Margaret Edith',
       'Johnston, Miss. Catherine Helen "Carrie"', 'Behr, Mr. Karl Howell',
       'Dooley, Mr. Patrick'],
      dtype='object', name='Name', length=891)

- To find preliminary Satatistics of the each column of the dataframe.

In [6]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


#### 2. To create new data frame from Numpy array. 

Lets create a random array of size(100,20) and random column names. We will use these array and column names to create the datafram in next step.

In [16]:
import random as random
A = np.random.rand(100,10)
letter = ['A','B','C','D','E','F','G','H','X']

col_names = [ random.choice(letter)\
             +random.choice(letter)\
             +random.choice(letter)\
             +random.choice(letter) for i in range(A.shape[1])]

In [17]:
print(col_names)

['AEEF', 'CGFE', 'HGDX', 'XABG', 'FCAD', 'XHGA', 'CECG', 'FAEX', 'BCGX', 'DBGH']


In [18]:
df = pd.DataFrame(A, columns = col_names )
df.head()

Unnamed: 0,AEEF,CGFE,HGDX,XABG,FCAD,XHGA,CECG,FAEX,BCGX,DBGH
0,0.799637,0.278088,0.548942,0.348199,0.358752,0.955875,0.656495,0.530828,0.110569,0.318718
1,0.697795,0.690462,0.592182,0.548039,0.570509,0.313997,0.481012,0.292011,0.473224,0.585045
2,0.940258,0.531903,0.812137,0.3671,0.719538,0.326831,0.942334,0.287535,0.575258,0.045669
3,0.769914,0.097332,0.385476,0.359515,0.906289,0.999929,0.275143,0.816647,0.690136,0.065278
4,0.696829,0.870229,0.372156,0.432654,0.256099,0.402601,0.067141,0.262879,0.337106,0.188892


- To save new dataframe to a file

In [19]:
df.to_csv('data/test.csv')

#### 3. To create new  data frame from  list of dictionaries.

Here we will create a list with collection of dictionaries. each of the dictionary will have keys and values. Using this list of dictionaries, we will create another dataframe. The keys of the dictionaries will serve as the column names.

In [21]:
LD = []
for i in range(100):
    LD.append({'Player' : random.choice(letter)+\
                          random.choice(letter)+\
                          random.choice(letter)+\
                          random.choice(letter),\
               'game1' : random.uniform(0,1),\
               'game2' : random.uniform(0,1),\
               'game3' : random.uniform(0,1),
               'game4' : random.uniform(0,1),
               'game5' : random.uniform(0,1)})

In [22]:
LD[0]

{'Player': 'FHFB',
 'game1': 0.7241234598979669,
 'game2': 0.1891636584406382,
 'game3': 0.31966051832438636,
 'game4': 0.14082264415821089,
 'game5': 0.27483543498639385}

In [24]:
DF = pd.DataFrame(LD)
DF=DF.set_index("Player")

In [26]:
DF.head(10)

Unnamed: 0_level_0,game1,game2,game3,game4,game5
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
FHFB,0.724123,0.189164,0.319661,0.140823,0.274835
AAXC,0.695117,0.943525,0.741207,0.329677,0.357047
XGEB,0.242482,0.831497,0.474151,0.951528,0.232285
FHED,0.414948,0.173729,0.757735,0.855202,0.105417
DCEA,0.778988,0.709245,0.800162,0.86,0.382418
XHCA,0.325321,0.922017,0.217526,0.227107,0.562493
DFAX,0.905453,0.427381,0.016011,0.654016,0.496549
CHCC,0.851235,0.747932,0.746749,0.724306,0.826185
EAHG,0.322508,0.276537,0.46876,0.096865,0.994927
EEXG,0.996789,0.813952,0.566497,0.069764,0.209985
