1. What it is?
2. Different ways to create a dataframe(List, Dictionary, from files)
3. Different ways to access pandas column, row, and element
4. Oprations on rows and columns
5. Preprocessing and Data Cleaning
6. Useful pandas functions
7. Case studies 
8. Iris dataset example
9. Documentation

# What is it?

As per the documentation: Pandas DataFrameTwo-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). Arithmetic operations
align on both row and column labels. Can be thought of as a dict-like
container for Series objects. Here axis-0 represents rows and axis-1 represents column<br>
Simple put DataFrame can be seen as a list of columns. Let's see how to create a DataFrame.

# Different ways to create a dataframe

In [2]:
import pandas as pd

In [4]:
# Create an empty Pandas DataFrame
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


Signature of DataFrame() function looks like this : pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False).<br>
Here data can be ndarray (structured or homogeneous), Iterable, dict, or DataFrame Dict can contain Series, arrays, constants, or list-like objects.<br>
Now let's pass a list as data. How do you think it will interpret list as a column or row?

In [19]:
# Create a Pandas DataFrame
data = [1,2,3,4,5] #
df = pd.DataFrame(data)
df

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


In previous example we saw that it considers list as a list of rows, so it breaks in into 5 rows.<br>What if we want to pass it as single row?

In [20]:
# Create a Pandas DataFrame with single row and list of lists.
data = [[1,2,3,4,5]] 
df = pd.DataFrame(data)
df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,5


Now it has only one element(a list) inside the list, hence it has only one row. Below example will make it clear. 

In [25]:
# Create a Pandas DataFrame with list of lists.
data = [[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15],[16,17,18,19,20]]
df = pd.DataFrame(data)
df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,5
1,6,7,8,9,10
2,11,12,13,14,15
3,16,17,18,19,20


In [3]:
col1 = [1,2,3,4,5]
col2 = ['A','B','C','D','E']
col3 = ['a','b','c','d','e']
data = list(zip(col1, col2, col3))
print(data)
df = pd.DataFrame(data)
df

[(1, 'A', 'a'), (2, 'B', 'b'), (3, 'C', 'c'), (4, 'D', 'd'), (5, 'E', 'e')]


Unnamed: 0,0,1,2
0,1,A,a
1,2,B,b
2,3,C,c
3,4,D,d
4,5,E,e


Now Let's pass other arguments like row index and colum names as per the signature of DataFrame(data=None, index=None, columns=None, dtype=None, copy=False).

In [26]:
# Create a Pandas DataFrame with row index or row names.
rows = ['row1', 'row2','row3','row4','row5']
data = [1,2,3,4,5]
df = pd.DataFrame(data, index=rows)
df

Unnamed: 0,0
row1,1
row2,2
row3,3
row4,4
row5,5


In [27]:
# Create a Pandas DataFrame with column names.
cols = ['col1']
data = ['a','b','c','d','d']
df = pd.DataFrame(data, columns=cols)
df

Unnamed: 0,col1
0,a
1,b
2,c
3,d
4,d


In [28]:
# Create a Pandas DataFrame with rows and column names.
rows = ['row1', 'row2','row3','row4','row5']
cols = ['col1', 'col2']
data = [[1,'a'],[2,'b'],[3,'c'],[4,'d'],[5,'e']] # 5 rows basically.
df = pd.DataFrame(data, columns=cols, index = rows)
df

Unnamed: 0,col1,col2
row1,1,a
row2,2,b
row3,3,c
row4,4,d
row5,5,e


We can pass datatype for the columns, but only one datatype can be passed so make sure all the columns are compatible with the datatype you are passing.

In [47]:
# Create a Pandas DataFrame with datatypes.
cols = ['col1', 'col2', 'col3']
data = [[1,11, 111],[2,22, 222],[3,33, 333]] # 3 rows.
df = pd.DataFrame(data, columns=cols, dtype=float)
df

Unnamed: 0,col1,col2,col3
0,1.0,11.0,111.0
1,2.0,22.0,222.0
2,3.0,33.0,333.0


Pandas modify the datatype of each columns.

In [48]:
#By default it will update all columns even if you pass it form only one column to the right.
dtype_dict = {'col2': int, 'col3': int}
df = df.astype(dtype_dict)
df

Unnamed: 0,col1,col2,col3
0,1.0,11,111
1,2.0,22,222
2,3.0,33,333


Pandas None values.

In [140]:
#None example
cols = ['col1', 'col2']
data = [[1,'a'],[2],[3,'c'],[4,'d'],[5]] # 5 rows basically.
df = pd.DataFrame(data, columns=cols)
df

Unnamed: 0,col1,col2
0,1,a
1,2,
2,3,c
3,4,d
4,5,


# create pandas DataFrame from dictionary object.

In [15]:
data = [{'name':'Gitesh', 'Gender':'Male','Address':'Hyderabad'}]
df = pd.DataFrame(data)
df

Unnamed: 0,name,Gender,Address
0,Gitesh,Male,Hyderabad


In [49]:
data = [{'name':'Gitesh', 'Gender':'Male','Address':'Hyderabad'}, {'name':'Suresh', 'Gender':'Male','Address':'Hyderabad'}]
df = pd.DataFrame(data)
df

Unnamed: 0,name,Gender,Address
0,Gitesh,Male,Hyderabad
1,Suresh,Male,Hyderabad


In [50]:
data = {'name':['Gitesh','Suresh'], 'Gender':['Male','Male'],'Address':['Hyderabad', 'Hyd']}
df = pd.DataFrame(data)
df

Unnamed: 0,name,Gender,Address
0,Gitesh,Male,Hyderabad
1,Suresh,Male,Hyd


In [51]:
data = {'name':'Gitesh', 'Gender':'Male','Address':'Hyderabad'}
df = pd.DataFrame(data, index=[1,2,3])
df

Unnamed: 0,name,Gender,Address
1,Gitesh,Male,Hyderabad
2,Gitesh,Male,Hyderabad
3,Gitesh,Male,Hyderabad


# Creating DataFrame from files like csv

In [4]:
#Reading without header
filename = "Data/prospects3.csv"
df = pd.read_csv(filename)
df.head()

Unnamed: 0,james,butler,jbutler@gmail.com,y,78528.5
0,Josephine,Darakjy,josephine_darakjy@darakjy.org,Y,56795.0
1,ART,VENERE,ART@VENERE.ORG,Yes,81600.0
2,Lenna,Paprocki,lpaprocki@hotmail.com,No,91506.0
3,donette,foller,donette.foller@cox.net,Y,61047.0
4,SIMONA,MORASCA,SIMONA@MORASCA.COM,y,58457.0


In [15]:
#Passing header
filename = "Data/prospects3.csv"
colnames = ['FIRST_NAME', 'LAST_NAME', 'EMAIL_ADDRESS', 'OWNS_CAR', 'ANNUAL_SALARY']
df = pd.read_csv(filename ,names=colnames, header=None)
df.head()

Unnamed: 0,FIRST_NAME,LAST_NAME,EMAIL_ADDRESS,OWNS_CAR,ANNUAL_SALARY
0,james,butler,jbutler@gmail.com,y,78528.5
1,Josephine,Darakjy,josephine_darakjy@darakjy.org,Y,56795.0
2,ART,VENERE,ART@VENERE.ORG,Yes,81600.0
3,Lenna,Paprocki,lpaprocki@hotmail.com,No,91506.0
4,donette,foller,donette.foller@cox.net,Y,61047.0


In [3]:
filename = "Data/mpg.csv"
df = pd.read_csv('Data/mpg.csv') # This file already have headers.
print(df.head())

FileNotFoundError: [Errno 2] File b'Data/mpg.csv' does not exist: b'Data/mpg.csv'

# Different ways to access pandas column, row, and element.

In [201]:
data = [[1,2,3],[10,20,30],[100,200,300], [111, 222, 333]]
cols = ['col1', 'col2', 'col3']
df = pd.DataFrame(data, columns = cols)
df

Unnamed: 0,col1,col2,col3
0,1,2,3
1,10,20,30
2,100,200,300
3,111,222,333


Pandas DataFrame access the column using its name.

In [203]:
df['col1']

0      1
1     10
2    100
3    111
Name: col1, dtype: int64

Aceesing the values of columns in Pandas DataFrame.

In [74]:
df['col1'].values

array([  1,  10, 100, 111], dtype=int64)

Get the pandas column value in list.

In [75]:
df['col1'].tolist()

[1, 10, 100, 111]

Pandas DataFrame access mutliple column using its name.

In [76]:
df[['col1','col2']]

Unnamed: 0,col1,col2
0,1,2
1,10,20
2,100,200
3,111,222


In [77]:
df[['col1','col2']].values.tolist()

[[1, 2], [10, 20], [100, 200], [111, 222]]

Pandas DataFrame access all columns using its name and store it inside a list.

In [78]:
ll = []
for col in df.columns:
    l = df[col].values.tolist()
    ll.append(l)
ll

[[1, 10, 100, 111], [2, 20, 200, 222], [3, 30, 300, 333]]

In [80]:
df

Unnamed: 0,col1,col2,col3
0,1,2,3
1,10,20,30
2,100,200,300
3,111,222,333


# Accessing pandas DataFrame rows and slicing.

Here the syntax to access pandas row is df[i:j:k] where i is the start index of row, j is last index(excluded) and k is the step. For example df[0:6:2] - This will return first row(0th index start) then skip k-1 rows so here (2-1)=1 rows will be skiped after the current rows so next rows it will return is row with index 2 then 4. It will not include row number 6 because it is not inclsive.

If we don't specify k by default it is 1.

In [86]:
df[1:4]

Unnamed: 0,col1,col2,col3
1,10,20,30
2,100,200,300
3,111,222,333


In [88]:
df[1:4:2]

Unnamed: 0,col1,col2,col3
1,10,20,30
3,111,222,333


In [89]:
df[1:4:2].values.tolist()

[[10, 20, 30], [111, 222, 333]]

# Oprations on rows and columns

Pandas DataFrame create or add or apend new column. Make sure new column length matches with the existing column length.

In [92]:
df['newColumn'] = [0,0,0,0]
df

Unnamed: 0,col1,col2,col3,newColumn
0,1,2,3,0
1,10,20,30,0
2,100,200,300,0
3,111,222,333,0


Pandas DataFrame create new column by adding two existing columns.

In [99]:
df['AddedColum'] = df['col1'] + df['col2']
df

Unnamed: 0,col1,col2,col3,AddedColum
0,1,2,3,3
1,10,20,30,30
2,100,200,300,300
3,111,222,333,333


In [135]:
data = [[1,2,3],[10,20,30],[100,200,300], [111, 222, 333]]
cols = ['col1', 'col2', 'col3']
df = pd.DataFrame(data, columns = cols)
df

Unnamed: 0,col1,col2,col3
0,1,2,3
1,10,20,30
2,100,200,300
3,111,222,333


Pandas DataFrame overwrite the existing row.

In [106]:
df.iloc[0] = [7]
df

Unnamed: 0,col1,col2,col3
0,7,7,7
1,10,20,30
2,100,200,300
3,111,222,333


In [112]:
df.iloc[0] = [7,8,9]
df

Unnamed: 0,col1,col2,col3
0,7,8,9
1,10,20,30
2,100,200,300
3,111,222,333


Pandas DataFrame appending new row. we can append only Series and DataFrame objs.

In [128]:
new_row = pd.DataFrame({'col1':[5],'col2':[6],'col3':[7]})
df.append(new_row)

Unnamed: 0,col1,col2,col3
0,7,8,9
1,10,20,30
2,100,200,300
3,111,222,333
0,5,6,7


Pandas DataFrame appending new row with index.

In [129]:
new_row = pd.DataFrame({'col1':[5],'col2':[6],'col3':[7]}, index = [4])
df.append(new_row)

Unnamed: 0,col1,col2,col3
0,7,8,9
1,10,20,30
2,100,200,300
3,111,222,333
4,5,6,7


Concatination of multiple pandas DataFrame.

In [131]:
row1 = pd.DataFrame({'col1':[5],'col2':[6],'col3':[7]}, index = [4])
row2 = pd.DataFrame({'col1':[8],'col2':[9],'col3':[10]}, index = [5])
pd.concat([df, row1, row2])

Unnamed: 0,col1,col2,col3
0,7,8,9
1,10,20,30
2,100,200,300
3,111,222,333
4,5,6,7
5,8,9,10


Pandas DataFrame deleting a column. We can pass inplace = True. 

In [137]:
df.drop(['col1'],axis=1)

Unnamed: 0,col2,col3
0,2,3
1,20,30
2,200,300
3,222,333


Pandas DataFrame deleting multiple columns

In [139]:
df.drop(['col1','col2'],axis=1)

Unnamed: 0,col3
0,3
1,30
2,300
3,333


In [140]:
cols=['col1','col2']
df.drop(columns = cols,axis=1)

Unnamed: 0,col3
0,3
1,30
2,300
3,333


In [143]:
df.drop(df.columns[[0,1]],axis=1)

Unnamed: 0,col3
0,3
1,30
2,300
3,333


In [145]:
df

Unnamed: 0,col1,col2,col3
0,1,2,3
1,10,20,30
2,100,200,300
3,111,222,333


Pandas DataFrame deleting rows.

In [155]:
df.drop([0,1,3]) # deleting 0th, 1st and 3rd row by index.

Unnamed: 0,col1,col2,col3
2,100,200,300


# Preprocessing and Data Cleaning.

In [158]:
filename = 'Data\\undergradSurvey.csv'
df = pd.read_csv(filename)
df.head()

Unnamed: 0,st_id,gender,age,class_st,major,grad intention,gpa,employment,salary,satisfaction
0,1.0,Female,20.0,Junior,Other,Yes,2.9,Full-Time,50.0,3.0
1,2.0,Male,23.0,Senior,Management,Yes,3.6,Part-Time,25.0,4.0
2,3.0,Male,21.0,Junior,Other,Yes,2.5,Part-Time,45.0,4.0
3,4.0,Male,21.0,Junior,MIS,Yes,2.5,Full-Time,40.0,6.0
4,5.0,Male,23.0,Senior,Other,Undecided,2.8,Unemployed,40.0,4.0


We can observe there are many NaN values inside some colums.

In [159]:
df.tail()

Unnamed: 0,st_id,gender,age,class_st,major,grad intention,gpa,employment,salary,satisfaction
59,60.0,Female,20.0,Sophomore,MIS,No,2.5,Part-Time,55.0,4.0
60,61.0,Female,23.0,Senior,Accounting,Yes,3.5,Part-Time,30.0,3.0
61,62.0,Female,23.0,Senior,Economics/Finance,No,3.2,Part-Time,70.0,3.0
62,,,,,,,,,,
63,,,26.0,,,,,,,


Pandas DataFrame check how many columns have NaN values.

In [160]:
df.isnull().any() # True means it contains NaN vlaues. So here all the columns have at least one NaN.

st_id             True
gender            True
age               True
class_st          True
major             True
grad intention    True
gpa               True
employment        True
salary            True
satisfaction      True
dtype: bool

Pandas DataFrame check count of NaN values inside each column.

In [161]:
df.isnull().sum()

st_id             2
gender            2
age               1
class_st          2
major             2
grad intention    2
gpa               2
employment        2
salary            2
satisfaction      2
dtype: int64

Pandas DataFrame check count of Not-NaN values inside each column.

In [163]:
df.count()

st_id             62
gender            62
age               63
class_st          62
major             62
grad intention    62
gpa               62
employment        62
salary            62
satisfaction      62
dtype: int64

Now we know everything about the presence of NaN values in our data, let's remove it.

Pandas DataFrame remove the rows having NaN values in at least on column.

In [164]:
df.dropna(axis=0).tail()# This remove all the rows having NaN vlaue in any of the column.

Unnamed: 0,st_id,gender,age,class_st,major,grad intention,gpa,employment,salary,satisfaction
57,58.0,Female,21.0,Senior,International Business,No,2.4,Part-Time,40.0,3.0
58,59.0,Female,20.0,Junior,MIS,No,2.9,Part-Time,40.0,4.0
59,60.0,Female,20.0,Sophomore,MIS,No,2.5,Part-Time,55.0,4.0
60,61.0,Female,23.0,Senior,Accounting,Yes,3.5,Part-Time,30.0,3.0
61,62.0,Female,23.0,Senior,Economics/Finance,No,3.2,Part-Time,70.0,3.0


Pandas DataFrame remove the rows having NaN values in selected columns.

In [167]:
df.dropna(axis=0, subset=['age']).tail() # it will remove the row only when NaN exist in age column, we specify multiple columns.

Unnamed: 0,st_id,gender,age,class_st,major,grad intention,gpa,employment,salary,satisfaction
58,59.0,Female,20.0,Junior,MIS,No,2.9,Part-Time,40.0,4.0
59,60.0,Female,20.0,Sophomore,MIS,No,2.5,Part-Time,55.0,4.0
60,61.0,Female,23.0,Senior,Accounting,Yes,3.5,Part-Time,30.0,3.0
61,62.0,Female,23.0,Senior,Economics/Finance,No,3.2,Part-Time,70.0,3.0
63,,,26.0,,,,,,,


In [168]:
df.dropna(axis=0, subset=['age', 'gender']).tail() # it will remove the row if any of the given column(age, gender) contains NaN.

Unnamed: 0,st_id,gender,age,class_st,major,grad intention,gpa,employment,salary,satisfaction
57,58.0,Female,21.0,Senior,International Business,No,2.4,Part-Time,40.0,3.0
58,59.0,Female,20.0,Junior,MIS,No,2.9,Part-Time,40.0,4.0
59,60.0,Female,20.0,Sophomore,MIS,No,2.5,Part-Time,55.0,4.0
60,61.0,Female,23.0,Senior,Accounting,Yes,3.5,Part-Time,30.0,3.0
61,62.0,Female,23.0,Senior,Economics/Finance,No,3.2,Part-Time,70.0,3.0


Pandas DataFrame remove the rows based on the number of Non- NaN values in the row.

In [170]:
df.dropna(axis=0,thresh = 1).tail() # Here it will delete the row only if it have less then 1 non-NaN values.

Unnamed: 0,st_id,gender,age,class_st,major,grad intention,gpa,employment,salary,satisfaction
58,59.0,Female,20.0,Junior,MIS,No,2.9,Part-Time,40.0,4.0
59,60.0,Female,20.0,Sophomore,MIS,No,2.5,Part-Time,55.0,4.0
60,61.0,Female,23.0,Senior,Accounting,Yes,3.5,Part-Time,30.0,3.0
61,62.0,Female,23.0,Senior,Economics/Finance,No,3.2,Part-Time,70.0,3.0
63,,,26.0,,,,,,,


Pandas DataFrame remove the column based on the number of NaN values in the column.It will search for only one NaN inside a column if it exists it will delete that column.

In [172]:
df.dropna(axis=1).head() # It has deleted all the columns.

0
1
2
3
4


Pandas DataFrame remove the column based on the number of NaN values in the column.It will check if a column contains all the values as NaN, that column will be dropped. We can thresh here too.

In [176]:
df.dropna(axis=1, how = 'all').head()

Unnamed: 0,st_id,gender,age,class_st,major,grad intention,gpa,employment,salary,satisfaction
0,1.0,Female,20.0,Junior,Other,Yes,2.9,Full-Time,50.0,3.0
1,2.0,Male,23.0,Senior,Management,Yes,3.6,Part-Time,25.0,4.0
2,3.0,Male,21.0,Junior,Other,Yes,2.5,Part-Time,45.0,4.0
3,4.0,Male,21.0,Junior,MIS,Yes,2.5,Full-Time,40.0,6.0
4,5.0,Male,23.0,Senior,Other,Undecided,2.8,Unemployed,40.0,4.0


# Dealing with categorical values

In [174]:
x1 = df.dropna(axis=0) # creating the dataframe into x1 after removing all the NaN rows. 
d = {'Male':0, 'Female':1}
x1['gender'] = x1['gender'].apply(lambda x:d[x])# This will assign Male to 0 and Female to 1
x1.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,st_id,gender,age,class_st,major,grad intention,gpa,employment,salary,satisfaction
0,1.0,1,20.0,Junior,Other,Yes,2.9,Full-Time,50.0,3.0
1,2.0,0,23.0,Senior,Management,Yes,3.6,Part-Time,25.0,4.0
2,3.0,0,21.0,Junior,Other,Yes,2.5,Part-Time,45.0,4.0
3,4.0,0,21.0,Junior,MIS,Yes,2.5,Full-Time,40.0,6.0
4,5.0,0,23.0,Senior,Other,Undecided,2.8,Unemployed,40.0,4.0


# Average out the missing values

In [179]:
df['age'].tail()

59    20.0
60    23.0
61    23.0
62     NaN
63    26.0
Name: age, dtype: float64

In [178]:
df['age'].fillna(df['age'].mean()).tail()

59    20.000000
60    23.000000
61    23.000000
62    21.206349
63    26.000000
Name: age, dtype: float64

# Useful pandas functions

In [192]:
print("each columns datatypes: \n",df.dtypes)
print("age column's datatype: \n",df.age.dtypes)
print("shape of the DataFrame(row*col) is : ",df.shape)
print("Distribution of different age in the given data:\n",df['age'].value_counts())

each columns datatypes: 
 st_id             float64
gender             object
age               float64
class_st           object
major              object
grad intention     object
gpa               float64
employment         object
salary            float64
satisfaction      float64
dtype: object
age column's datatype: 
 float64
shape of the DataFrame(row*col) is :  (64, 10)
Distribution of different age in the given data:
 21.0    22
20.0    14
22.0    11
19.0     5
23.0     5
24.0     3
26.0     2
18.0     1
Name: age, dtype: int64


In [182]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 10 columns):
st_id             62 non-null float64
gender            62 non-null object
age               63 non-null float64
class_st          62 non-null object
major             62 non-null object
grad intention    62 non-null object
gpa               62 non-null float64
employment        62 non-null object
salary            62 non-null float64
satisfaction      62 non-null float64
dtypes: float64(5), object(5)
memory usage: 5.1+ KB


In [183]:
df.describe()

Unnamed: 0,st_id,age,gpa,salary,satisfaction
count,62.0,63.0,62.0,62.0,62.0
mean,31.5,21.206349,3.129032,48.548387,3.741935
std,18.041619,1.546679,0.377388,12.080912,1.213793
min,1.0,18.0,2.3,25.0,1.0
25%,16.25,20.0,2.9,40.0,3.0
50%,31.5,21.0,3.15,50.0,4.0
75%,46.75,22.0,3.4,55.0,4.0
max,62.0,26.0,3.9,80.0,6.0


# Case studies

In [200]:
#Wine data
# cols = ['fixed acidity', 'volatile acidity','citric acid','residual sugar', 'chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol','quality']
data = pd.read_csv("Data/winequality-white.csv", sep = ';')
df = pd.DataFrame(data)
print(df.shape)
print(df.columns)

df.head()

(4898, 12)
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [466]:
# Data cleaning
filename = "Data/prospects3.csv"
colnames = ['FIRST_NAME', 'LAST_NAME', 'EMAIL_ADDRESS', 'OWNS_CAR', 'ANNUAL_SALARY']
df = pd.read_csv(filename ,names=colnames, header=None)
df.head()


Unnamed: 0,FIRST_NAME,LAST_NAME,EMAIL_ADDRESS,OWNS_CAR,ANNUAL_SALARY
0,james,butler,jbutler@gmail.com,y,78528.5
1,Josephine,Darakjy,josephine_darakjy@darakjy.org,Y,56795.0
2,ART,VENERE,ART@VENERE.ORG,Yes,81600.0
3,Lenna,Paprocki,lpaprocki@hotmail.com,No,91506.0
4,donette,foller,donette.foller@cox.net,Y,61047.0


In [467]:
# Clean it.
#check clean.py
#Signature:- lambda arg1,arg2..., argn : operation
def myfunc(OWNS_CAR):
    if(OWNS_CAR =='yes' or OWNS_CAR == 'y'):
        return 'y'
    else:
        return 'n'

In [469]:
# df['OWNS_CAR'] = df.apply(lambda x: myfunc(x.OWNS_CAR), axis=1)
df['OWNS_CAR'] = df['OWNS_CAR'].apply(lambda x: myfunc(x))
# df['OWNS_CAR'] = df.apply(lambda x: myfunc(x['OWNS_CAR']),axis=1)
df.head()

Unnamed: 0,FIRST_NAME,LAST_NAME,EMAIL_ADDRESS,OWNS_CAR,ANNUAL_SALARY
0,james,butler,jbutler@gmail.com,y,78528.5
1,Josephine,Darakjy,josephine_darakjy@darakjy.org,n,56795.0
2,ART,VENERE,ART@VENERE.ORG,n,81600.0
3,Lenna,Paprocki,lpaprocki@hotmail.com,n,91506.0
4,donette,foller,donette.foller@cox.net,n,61047.0


# Iris

In [184]:
iris = pd.read_csv('Data/iris.csv')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [185]:
#Get important informations
print("shape: \n", iris.shape)
print("columns: \n", iris.columns)
print("species counts : \n", iris['species'].value_counts())

shape: 
 (150, 5)
columns: 
 Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')
species counts : 
 setosa        50
virginica     50
versicolor    50
Name: species, dtype: int64


In [193]:
cols = ['sepal_length','sepal_width','petal_length','petal_width']
x = iris[cols]
x.head() 

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [194]:
y = iris['species']
y.head()

0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: species, dtype: object

In [196]:
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score

In [197]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2)
y_test.head()

1          setosa
75     versicolor
67     versicolor
27         setosa
122     virginica
Name: species, dtype: object

In [198]:
y_test.value_counts()

versicolor    13
virginica      9
setosa         8
Name: species, dtype: int64

In [199]:
clf = tree.DecisionTreeClassifier()
clf.fit(x_train,y_train)

result = clf.predict(x_test)
score = accuracy_score(result, y_test)
print(score)

0.9666666666666667


# miscellaneous

In [365]:
df.T

Unnamed: 0,0,1,2
0,1,A,a
1,2,B,b
2,3,C,c
3,4,D,d
4,5,E,e


In [482]:
df['gender'].value_counts()

Female    33
Male      29
Name: gender, dtype: int64

In [13]:
df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)], 'C':[2,4,7,11]})
df

Unnamed: 0,A,B,C
0,0,0,2
1,1,2,4
2,2,4,7
3,3,6,11


In [9]:
df['A'].corr(df['B'])

1.0

In [14]:
df.corr()

Unnamed: 0,A,B,C
A,1.0,1.0,0.989071
B,1.0,1.0,0.989071
C,0.989071,0.989071,1.0


In [549]:
df.corr()

Unnamed: 0,A,B
A,1.0,1.0
B,1.0,1.0


In [2]:
data = [[1,2,'acm'],[10,20,'bcc'],[100,200,'acm'], [111, 222, 'bcc']]
cols = ['col1', 'col2', 'col3']
df = pd.DataFrame(data, columns = cols)
df

Unnamed: 0,col1,col2,col3
0,1,2,acm
1,10,20,bcc
2,100,200,acm
3,111,222,bcc


In [4]:
mask = df.col3.str.contains("ac")
df_Accl = df[mask]
df_NonAccl = df[~mask]

In [5]:
df_Accl

Unnamed: 0,col1,col2,col3
0,1,2,acm
2,100,200,acm


In [6]:
df_NonAccl

Unnamed: 0,col1,col2,col3
1,10,20,bcc
3,111,222,bcc
