##### Data Preprocessing
**Import Packages and CSV**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns',None)
df =  pd.read_csv("EasyVisa.csv")
print(df.shape)

(25480, 12)


#### Data Cleaning:
**Handling Missing Values**

- Handling Missing Values
- Handling Duplicates.
- Remove irrelevant features
- check Datatypes.
- Understand the dataset

#### 3.1 Check Null Values

In [2]:
features_with_na_values = [features for features in df.columns if df[features].isnull().sum()>=1]
for feat in features_with_na_values:
    print(feat,np.round(df[feat].isnull().mean()*100,5),'% missing values')

In [3]:
features_with_na_values

[]

#### 3.2 Check Duplicates

In [4]:
df.duplicated().sum()

np.int64(0)

In [22]:
df[['no_of_employees','yr_of_estab','prevailing_wage']].value_counts()

no_of_employees  yr_of_estab  prevailing_wage
-26              1923         5247.3200          1
                 1954         81982.2700         1
                 1968         168.1558           1
                 1996         37397.0500         1
                 2004         84359.9800         1
                                                ..
 547172          1838         22859.2200         1
 579004          1969         103507.0100        1
 581468          1984         41397.5200         1
 594472          1887         87144.2000         1
 602069          2011         25181.6300         1
Name: count, Length: 25480, dtype: int64

In [20]:
df[(df['yr_of_estab']==2000)&(df['no_of_employees']==802)] # 2000 and 802

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
3043,EZYV3044,North America,Doctorate,Y,N,802,2000,West,109585.81,Year,Y,Certified
3172,EZYV3173,Asia,Bachelor's,N,N,802,2000,Midwest,112499.63,Year,Y,Denied
14306,EZYV14307,Europe,High School,N,N,802,2000,Northeast,24821.87,Year,Y,Denied
19887,EZYV19888,Asia,Bachelor's,N,N,802,2000,South,147219.69,Year,Y,Certified


**Dataset contains any duplicate records**

##### 3.3 Remove Irrelevant features
- case_id : This feature will not be required for further model building. 

In [25]:
df2 = df.drop(columns=['case_id'])#by default axis = 1 drops column
df2

Unnamed: 0,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,Asia,Master's,Y,N,2412,2002,Northeast,83425.6500,Year,Y,Certified
2,Asia,Bachelor's,N,Y,44444,2008,West,122996.8600,Year,Y,Denied
3,Asia,Bachelor's,N,N,98,1897,West,83434.0300,Year,Y,Denied
4,Africa,Master's,Y,N,1082,2005,South,149907.3900,Year,Y,Certified
...,...,...,...,...,...,...,...,...,...,...,...
25475,Asia,Bachelor's,Y,Y,2601,2008,South,77092.5700,Year,Y,Certified
25476,Asia,High School,Y,N,3274,2006,Northeast,279174.7900,Year,Y,Certified
25477,Asia,Master's,Y,N,1121,1910,South,146298.8500,Year,N,Certified
25478,Asia,Master's,Y,Y,1918,1887,West,86154.7700,Year,Y,Certified


#### 3.4 Check DataTypes

In [26]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25480 entries, 0 to 25479
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   continent              25480 non-null  object 
 1   education_of_employee  25480 non-null  object 
 2   has_job_experience     25480 non-null  object 
 3   requires_job_training  25480 non-null  object 
 4   no_of_employees        25480 non-null  int64  
 5   yr_of_estab            25480 non-null  int64  
 6   region_of_employment   25480 non-null  object 
 7   prevailing_wage        25480 non-null  float64
 8   unit_of_wage           25480 non-null  object 
 9   full_time_position     25480 non-null  object 
 10  case_status            25480 non-null  object 
dtypes: float64(1), int64(2), object(8)
memory usage: 2.1+ MB


- It does not contians any Null Records and Datatypes are also correct.
- one float variable , two int variable and eight object variable datatypes.
- Three Numerical Features and Eight Categorical Features.

#### 3.6 Understand DataSet
- Descriptive Stats Understanding

In [27]:
df2.describe()

Unnamed: 0,no_of_employees,yr_of_estab,prevailing_wage
count,25480.0,25480.0,25480.0
mean,5667.04321,1979.409929,74455.814592
std,22877.928848,42.366929,52815.942327
min,-26.0,1800.0,2.1367
25%,1022.0,1976.0,34015.48
50%,2109.0,1997.0,70308.21
75%,3504.0,2005.0,107735.5125
max,602069.0,2016.0,319210.27


In [37]:
df2[df2['no_of_employees']<1].shape[0]

33

In [40]:
df2[df2['yr_of_estab']>2005]['yr_of_estab'].value_counts()

yr_of_estab
2007    994
2006    844
2010    743
2008    674
2009    640
2013    533
2011    518
2012    492
2014    175
2015     64
2016     23
Name: count, dtype: int64

**No of employees**
- No of employees has mean 5667 and median of 2109
- Standarad Deviation is 22877 away from mean it means it has high spread
- 50% of no employees are less than 2109 which shows 50% of data contains lesser values
- max is 602069 and min -26. This contains outlier need to be handled.
- mean > median >= Mode : right skewedness indication.

**Year of Establishment**
- It contains record from 1800's to 2016's of year of establishment.
- 