# Data Understanding 




In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession 

In [3]:
pyspark = SparkSession.builder \
.master("local[4]")\
.appName("DataPreprocessing")\
.config("spark.executer.memory","3g")\
.config("spark.driver.memory","3g")\
.getOrCreate()

In [4]:
sc = pyspark.sparkContext

# --> Dataset reading and checking

In adult dataset was collected information about people and was predicted their revenues which are higher than 50k or lower than 50k.

#### Loading train dataset

In [5]:
train_data_df = spark.read\
.option("header", "True")\
.option("inferSchema", "True")\
.option("sep", ",")\
.csv("data/adult.data")

#### Loading test dataset

In [6]:
test_data_df = spark.read\
.option("header", "True")\
.option("inferSchema", "True")\
.option("sep", ",")\
.csv("data/adult.test")

In [7]:
train_data_df.show(5)

+---+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+--------------+--------------+------+
|age|        workclass|  fnlwgt| education|education_num|     marital_status|        occupation|  relationship|  race|    sex|capital_gain|capital_loss|hours_per_week|native_country|output|
+---+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+--------------+--------------+------+
| 39|        State-gov| 77516.0| Bachelors|         13.0|      Never-married|      Adm-clerical| Not-in-family| White|   Male|      2174.0|         0.0|          40.0| United-States| <=50K|
| 50| Self-emp-not-inc| 83311.0| Bachelors|         13.0| Married-civ-spouse|   Exec-managerial|       Husband| White|   Male|         0.0|         0.0|          13.0| United-States| <=50K|
| 38|          Private|215646.0|   HS-grad|       

### For showing better appearance, dataset is converted to pandas dataframe

In [8]:
train_data_df.limit(5).toPandas().head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,output
0,39,State-gov,77516.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K
1,50,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
2,38,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
3,53,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
4,28,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


#### Train and Test dataset are aggregating

In [9]:
whole_df = train_data_df.union(test_data_df)
whole_df.limit(5).toPandas().head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,output
0,39,State-gov,77516.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K
1,50,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
2,38,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
3,53,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
4,28,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


#### Checking consistency of row numbers

In [10]:
print("Train set  row number: ", train_data_df.count())
print("Test  set  row number: ", test_data_df.count())
print("Whole set  row number: ", whole_df.count())

Train set  row number:  32561
Test  set  row number:  16281
Whole set  row number:  48842


### --> Analysing Dataset and comparing with Schema
Analysing data values types (string, int, float)

In [11]:
whole_df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: double (nullable = true)
 |-- education: string (nullable = true)
 |-- education_num: double (nullable = true)
 |-- marital_status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital_gain: double (nullable = true)
 |-- capital_loss: double (nullable = true)
 |-- hours_per_week: double (nullable = true)
 |-- native_country: string (nullable = true)
 |-- output: string (nullable = true)



### --> Describing stastistics of numeric values
Count, mean, standard deviation, min and max values

In [12]:
whole_df.describe(["age","fnlwgt","education_num","capital_gain","capital_loss","hours_per_week"])\
.toPandas().head()

Unnamed: 0,summary,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
0,count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
1,mean,38.64358543876172,189664.13459727284,10.078088530363212,1079.0676262233324,87.50231358257237,40.422382375824085
2,stddev,13.710509934443564,105604.02542315732,2.570972755592263,7452.019057655401,403.0045521243599,12.3914440242523
3,min,17.0,12285.0,1.0,0.0,0.0,1.0
4,max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


### --> Describing stastistics of categorical values
Using groupBy() gives more information about categorical variables. In loaded dataset has 9 categorical variables which are workclass, education, occupation, relationship, marital_status, race, sex, native_country and output.

### 1. workclass
Interpretation: Here does not have any issue

In [13]:
from pyspark.sql import functions as F

In [14]:
whole_df.groupBy(F.col("workclass")).agg({"*":"count"}).toPandas().head()

Unnamed: 0,workclass,count(1)
0,State-gov,1981
1,Federal-gov,1432
2,Self-emp-not-inc,3862
3,Local-gov,3136
4,Private,33906


### Creating TempView and grouping by SQL query

In [15]:
whole_df.createOrReplaceTempView("workclassTable")

In [16]:
spark.sql("""
    SELECT workclass, COUNT(*) FROM workclassTable
    GROUP BY workclass 
    LIMIT 10
""").toPandas().head()

Unnamed: 0,workclass,count(1)
0,State-gov,1981
1,Federal-gov,1432
2,Self-emp-not-inc,3862
3,Local-gov,3136
4,Private,33906


### 2. education
Interpretation: Here does not have outlier value but some of categories can be joined.
For example 7th-8th can be joined as a primary school

In [17]:
whole_df.groupBy(F.col("education")).agg({"*":"count"}).toPandas().head(20)

Unnamed: 0,education,count(1)
0,Prof-school,834
1,10th,1389
2,7th-8th,955
3,5th-6th,509
4,Assoc-acdm,1601
5,Assoc-voc,2061
6,Masters,2657
7,12th,657
8,Preschool,83
9,9th,756


### 3 marital_status
Interpretation: Here does not have any issue excep of one weak category

In [18]:
whole_df.groupBy(F.col("marital_status")).agg({"*":"count"}).toPandas().head(20)

Unnamed: 0,marital_status,count(1)
0,Widowed,1518
1,Married-spouse-absent,628
2,Married-AF-spouse,37
3,Married-civ-spouse,22379
4,Divorced,6633
5,Never-married,16117
6,Separated,1530


### 4. occupation
Interpretation: Here has ? occupation which means unknown and a one weak category with 15 people

In [19]:
whole_df.groupBy(F.col("occupation")).agg({"*":"count"}).toPandas().head(20)

Unnamed: 0,occupation,count(1)
0,Farming-fishing,1490
1,Handlers-cleaners,2072
2,Prof-specialty,6172
3,Adm-clerical,5611
4,Exec-managerial,6086
5,Craft-repair,6112
6,Sales,5504
7,?,2809
8,Tech-support,1446
9,Transport-moving,2355


### 5. relationship
Interpretation: Here does not show any issue (consistency)

In [20]:
whole_df.groupBy(F.col("relationship")).agg({"*":"count"}).toPandas().head(20)

Unnamed: 0,relationship,count(1)
0,Husband,19716
1,Own-child,7581
2,Not-in-family,12583
3,Other-relative,1506
4,Wife,2331
5,Unmarried,5125


### 6. race 
Interpretation: Here does not show any issue (consistency)

In [21]:
whole_df.groupBy(F.col("race")).agg({"*":"count"}).toPandas().head()

Unnamed: 0,race,count(1)
0,Asian-Pac-Islander,1519
1,Black,4685
2,Other,406
3,White,41762
4,Amer-Indian-Eskimo,470


### 7.sex (gender)

In [22]:
whole_df.groupBy(F.col("sex")).agg({"*":"count"}).toPandas().head()

Unnamed: 0,sex,count(1)
0,Male,32650
1,Female,16192


### 8. native_count
Interpretation: Here has only one person from Holand-Netherland. The most people are from US

In [23]:
whole_df.groupBy(F.col("native_country")).agg({"*":"count"}).toPandas().head(50)

Unnamed: 0,native_country,count(1)
0,Dominican-Republic,103
1,Ireland,37
2,Cuba,138
3,Guatemala,88
4,Iran,59
5,Taiwan,65
6,El-Salvador,155
7,United-States,43832
8,South,115
9,Japan,92


### 9. output
Interpretation: Here is showed that does not have issue, but >50K. and <=50K. have points. Therefore they need to be cleaned (delete points)

In [24]:
whole_df.groupBy(F.col("output")).agg({"*":"count"}).show()

+-------+--------+
| output|count(1)|
+-------+--------+
|   >50K|    7841|
|  >50K.|    3846|
| <=50K.|   12435|
|  <=50K|   24720|
+-------+--------+

