# Sample Code
Objective: Creating a pipeline to select eligible subjects based on certain attributes fulfilling requisite criteria


**Step 1**

The required modules are uploaded:
1. pandas
2. numpy
3. matplotlib.pyplot

**Step 2**

The relevant data files are downloaded and saved as csv files. Our work starts from reading the relevant csv files into workspace. The original files have been modified with representative columns and data.

**Step 3**
An overview of dataframes and datatypes of different columns is taken.
All the columns have data type of object. Deeper probing shows the data type to be string. 

This is a problem as Attribute_1, Attribute_4 and Attribute_5 should be integer values in df1 and Attribute_6 and Attribute_7 in df2.

**Step 4**

The relevant columns are selected from df1 and df2.
These are Attribute_1, Attribute_2, Attribute_3 and Attribute_4 from df1.
Attribute_6 is selected from df2.

**Step 5**

Analysis of the column names shows presence of 'long and dirty' names which are cleaned and shortened to relevant names for the attributes.

**Step 6**

Further data cleaning is performed.

In df1, Attribute_4 is of string data type and has comma between digits.

In df2, Attribute_4 is of string data type and has comma between digits.

The comma is removed and the data type is converted from stirng to numeric(float).

Attribute_1 of df1 and Attribute_7 of df2 are converted to numeric data type (float) from string.

**Step 7**

The dataframes df1 and df2 are cleaned and ready.

The final dataframe df is created by merging df1 and df2 to get all the required attributes.

The data frame df is ready for further graphical and quantitative exploratory data analysis.

**Step 8**

Three functions are created to select relevant data based on requisite criteria and output the data to a csv file.

In [91]:
# Step 1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [122]:
# Step 2
path_file_1 = 'C:\\Users\\Arunank\\Documents\\Data Science\\GitHub\\file_1.csv'
path_file_2 = 'C:\\Users\\Arunank\\Documents\\Data Science\\GitHub\\file_2.csv'
df1_original = pd.read_csv(path_file_1)
df2_original = pd.read_csv(path_file_2)

# Step 3
df1_original.info()
df2_original.info()

# Step 4
col_df1 = ['This is Attribute_1', 'This is Attribute_2', 'This is Attribute_3', 'This is Attribute_4']
df1 = df1_original[col_df1]

col_df2 = ['This is Attribute_6', 'This is Attribute_7']
df2 = df2_original[col_df2]

# Step 5
cols_1 = df1.columns
rename_df1 = ['Attribute_1', 'Attribute_2', 'Attribute_3', 'Attribute_4']

for i in range(len(cols_1)):
    df1 = df1.rename(index=str, columns={cols_1[i]: rename_df1[i]})

cols_2 = df2.columns
rename_df2 = ['Attribute_6', 'Attribute_7']

for i in range(len(cols_2)):
    df2 = df2.rename(index=str, columns={cols_2[i]: rename_df2[i]})
    
# Step 6
str_to_int_1 = ['Attribute_4']
str_to_int_2 = ['Attribute_6']

df1['Attribute_4'] = df1['Attribute_4'].str.replace(',','').astype('float')
df2['Attribute_6'] = df2['Attribute_6'].str.replace(',','').astype('float')

df1['Attribute_1'] = df1['Attribute_1'].astype('float')
df2['Attribute_7'] = df2['Attribute_7'].astype('float')

# Step 7
df = df1.join(df2)
df.index.name = 'Index'

# Defining functions: Eligible_1, Eligible_2, DetailsByName

def Eligible_1(df, attribute_1='Attribute_4', criteria_1=3000000,
                 attribute_2='Attribute_7', criteria_2=20): 
    list_1 = df[(df[attribute_1] >= criteria_1) &
                          (df[attribute_2] >= criteria_2)]
    list_1.to_csv('C:\\Users\\Arunank\\Documents\\Data Science\\GitHub\\eligible_1.csv')
    return list_1

def Eligible_2(df, attribute_1='Attribute_6', criteria_1=300000):
    list_2 = df[(df[attribute_1] >= criteria_1)]
    list_2.to_csv('C:\\Users\\Arunank\\Documents\\Data Science\\GitHub\\eligible_2.csv')
    return list_2


def DetailsByName(df, name, col_name = 'Attribute_2'):
    return df[df[col_name].str.contains(str(name))]




In [129]:
Eligible_1(df=df)

Unnamed: 0_level_0,Attribute_1,Attribute_2,Attribute_3,Attribute_4,Attribute_6,Attribute_7
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,4.0,Virender,Goal_1,4510345.0,100897.0,23.0
3,4.0,Rahul,Goal_1,4829965.0,487345.0,45.0
4,5.0,Mahendra,Goal_4,9845789.0,495396.0,34.0


In [130]:
Eligible_2(df=df)

Unnamed: 0_level_0,Attribute_1,Attribute_2,Attribute_3,Attribute_4,Attribute_6,Attribute_7
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,4.0,Rahul,Goal_1,4829965.0,487345.0,45.0
4,5.0,Mahendra,Goal_4,9845789.0,495396.0,34.0


In [131]:
DetailsByName(df=df, name='Mahendra')

Unnamed: 0_level_0,Attribute_1,Attribute_2,Attribute_3,Attribute_4,Attribute_6,Attribute_7
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4,5.0,Mahendra,Goal_4,9845789.0,495396.0,34.0


In [132]:
df

Unnamed: 0_level_0,Attribute_1,Attribute_2,Attribute_3,Attribute_4,Attribute_6,Attribute_7
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,4.0,Virender,Goal_1,4510345.0,100897.0,23.0
1,5.0,Virat,Goal_2,7422378.0,235456.0,12.0
2,3.0,Karun,Goal_3,952394.0,136985.0,59.0
3,4.0,Rahul,Goal_1,4829965.0,487345.0,45.0
4,5.0,Mahendra,Goal_4,9845789.0,495396.0,34.0
