### Pandas DataFrame

**DataFrame** is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# create Data Frame

columns = ['name', 'age', 'gender', 'job']

user1 = pd.DataFrame([['aman', 19, "M", "student"],
['Mohit', 26, "M", "student"]],columns=columns)

user2 = pd.DataFrame([['kanak', 27, "F", "manager"],
['priya', 58, "F", "manager"]],columns=columns)

user3 = pd.DataFrame(dict(name=['pankaj', 'jinat'],
age=[33, 44], gender=['M', 'F'],job=['engineer', 'scientist']))



### Combining DataFrames

**Concatenate DataFrame**

In [4]:
user1.append(user2)
users = pd.concat([user1, user2, user3])

print(users)

   age gender        job    name
0   19      M    student    aman
1   26      M    student   Mohit
0   27      F    manager   kanak
1   58      F    manager   priya
0   33      M   engineer  pankaj
1   44      F  scientist   jinat


**Join DataFrame**

In [14]:
user5 = pd.DataFrame(dict(name=['aman', 'kanak', 'Mohit', 'priya'],
height=[165, 180, 175, 171]))

#print(user4)
# Use union of keys from both frames

users = pd.merge(users, user5, on="name", how='outer')
print(users)

    age gender        job    name  height_x  height_y  height_x  height_y  \
0  19.0      M    student    aman       NaN       NaN     165.0     165.0   
1  26.0      M    student   Mohit       NaN       NaN     175.0     175.0   
2  27.0      F    manager   kanak       NaN       NaN     180.0     180.0   
3  58.0      F    manager   priya       NaN       NaN     171.0     171.0   
4  33.0      M   engineer  pankaj       NaN       NaN       NaN       NaN   
5  44.0      F  scientist   jinat       NaN       NaN       NaN       NaN   
6   NaN    NaN        NaN   alice     165.0     165.0       NaN       NaN   
7   NaN    NaN        NaN    john     180.0     180.0       NaN       NaN   
8   NaN    NaN        NaN    eric     175.0     175.0       NaN       NaN   
9   NaN    NaN        NaN   julie     171.0     171.0       NaN       NaN   

   height_x  height_y  
0     165.0     165.0  
1     175.0     175.0  
2     180.0     180.0  
3     171.0     171.0  
4       NaN       NaN  
5       

### Summarizing


**examine the users data**

In [15]:
users # print the first 30 and last 30 rows
type(users) # DataFrame

users.head() # print the first 5 rows
users.tail() # print the last 5 rows
users.index # "the index" (aka "the labels")

users.columns # column names (which is "an index")
users.dtypes # data types of each column

users.shape # number of rows and columns

users.values # underlying numpy array

users.info() # concise summary (includes memory usage as of pandas 0.15.0)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 10 columns):
age         6 non-null float64
gender      6 non-null object
job         6 non-null object
name        10 non-null object
height_x    4 non-null float64
height_y    4 non-null float64
height_x    4 non-null float64
height_y    4 non-null float64
height_x    4 non-null float64
height_y    4 non-null float64
dtypes: float64(7), object(3)
memory usage: 880.0+ bytes


**Columns selection**

In [16]:
users['gender'] # select one column

type(users['gender']) # Series

users.gender # select one column using the DataFrame

# select multiple columns
users[['age', 'gender']] # select two columns

my_cols = ['age', 'gender'] # or, create a list...

users[my_cols] # ...and use that list to select columns

type(users[my_cols])

pandas.core.frame.DataFrame

**Sorting**

In [17]:
df = users.copy()
df.age.sort_values() # only works for a Series
df.sort_values(by='age') # sort rows by a specific column
df.sort_values(by='age', ascending=False) # use descending order instead
df.sort_values(by=['job', 'age']) # sort by multiple columns
df.sort_values(by=['job', 'age'], inplace=True) # modify df

print(df)

    age gender        job    name  height_x  height_y  height_x  height_y  \
4  33.0      M   engineer  pankaj       NaN       NaN       NaN       NaN   
2  27.0      F    manager   kanak       NaN       NaN     180.0     180.0   
3  58.0      F    manager   priya       NaN       NaN     171.0     171.0   
5  44.0      F  scientist   jinat       NaN       NaN       NaN       NaN   
0  19.0      M    student    aman       NaN       NaN     165.0     165.0   
1  26.0      M    student   Mohit       NaN       NaN     175.0     175.0   
6   NaN    NaN        NaN   alice     165.0     165.0       NaN       NaN   
7   NaN    NaN        NaN    john     180.0     180.0       NaN       NaN   
8   NaN    NaN        NaN    eric     175.0     175.0       NaN       NaN   
9   NaN    NaN        NaN   julie     171.0     171.0       NaN       NaN   

   height_x  height_y  
4       NaN       NaN  
2     180.0     180.0  
3     171.0     171.0  
5       NaN       NaN  
0     165.0     165.0  
1     17

**Summarize all numeric columns**

In [18]:
print(df.describe())

             age    height_x    height_y    height_x    height_y    height_x  \
count   6.000000    4.000000    4.000000    4.000000    4.000000    4.000000   
mean   34.500000  172.750000  172.750000  172.750000  172.750000  172.750000   
std    14.237275    6.344289    6.344289    6.344289    6.344289    6.344289   
min    19.000000  165.000000  165.000000  165.000000  165.000000  165.000000   
25%    26.250000  169.500000  169.500000  169.500000  169.500000  169.500000   
50%    30.000000  173.000000  173.000000  173.000000  173.000000  173.000000   
75%    41.250000  176.250000  176.250000  176.250000  176.250000  176.250000   
max    58.000000  180.000000  180.000000  180.000000  180.000000  180.000000   

         height_y  
count    4.000000  
mean   172.750000  
std      6.344289  
min    165.000000  
25%    169.500000  
50%    173.000000  
75%    176.250000  
max    180.000000  


In [19]:
#Statistics per group (groupby)

print(df.groupby("job").mean())

            age  height_x  height_y  height_x  height_y  height_x  height_y
job                                                                        
engineer   33.0       NaN       NaN       NaN       NaN       NaN       NaN
manager    42.5       NaN       NaN     175.5     175.5     175.5     175.5
scientist  44.0       NaN       NaN       NaN       NaN       NaN       NaN
student    22.5       NaN       NaN     170.0     170.0     170.0     170.0
