<a href="https://colab.research.google.com/github/YousraAshour/PySpark/blob/main/Practical_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Labs 1 and 2 PySpark:**

In these labs we will be using the "[[NeurIPS 2020] Data Science for COVID-19 (DS4C)](https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset?select=PatientInfo.csv)" dataset, retrieved from [Kaggle](https://www.kaggle.com/) on 1/6/2022, for educational non commercial purpose, License
[CC BY-NC-SA 4.0
](https://creativecommons.org/licenses/by-nc-sa/4.0/)


The csv file that we will be using in this lab is **PatientInfo**.

## PatientInfo.csv

**patient_id**
the ID of the patient

**sex**
the sex of the patient

**age**
the age of the patient

**country**
the country of the patient

**province**
the province of the patient

**city**
the city of the patient

**infection_case**
the case of infection

**infected_by**
the ID of who infected the patient


**contact_number**
the number of contacts with people

**symptom_onset_date**
the date of symptom onset

**confirmed_date**
the date of being confirmed

**released_date**
the date of being released

**deceased_date**
the date of being deceased

**state**
isolated / released / deceased

### Import the pyspark and check it's version

In [None]:
! pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 47 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 50.2 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=949fc4be7df6b7475718e927721aef103152653ab9c8055a5e3feb96e30e1a8f
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


### Import and create SparkSession

In [None]:
import pyspark

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.getOrCreate()

### Load the PatientInfo.csv file and show the first 5 rows

In [None]:
from IPython.display import display, HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

In [None]:
df = spark.read.csv("/content/PatientInfo.csv", header=True, inferSchema=True)
df.head(5)

AnalysisException: ignored

### Display the schema of the dataset

In [None]:
df.printSchema()

### Display the statistical summary

In [None]:
df.summary().show()

### Using the state column.
### How many people survived (released), and how many didn't survive (isolated/deceased)?

In [None]:
df.groupBy("state").count().show()

### Display the number of null values in each column

In [None]:
from pyspark.sql.functions import col,isnan,when,count
df.select([count(when(col(c).isNull(),c)).alias(c)
                    for c in df.columns]).show()

## Data preprocessing

### Fill the nulls in the deceased_date with the released_date. 
- You can use <b>coalesce</b> function

In [None]:
from pyspark.sql.functions import coalesce
df = df.withColumn('deceased_date',coalesce(df["released_date"],
                                               df["deceased_date"]))

### Add a column named no_days which is difference between the deceased_date and the confirmed_date then show the top 5 rows. Print the schema.
- <b> Hint: You need to typecast these columns as date first <b>

In [None]:
from pyspark.sql.types import DateType

In [None]:
df = df.withColumn("deceased_date",col("deceased_date").cast(DateType()))
df = df.withColumn("confirmed_date",col("confirmed_date").cast(DateType()))

In [None]:
df.select("deceased_date").show()

In [None]:
#df.withColumn('no_days',(df["deceased_date_updated"] - df["confirmed_date"])).show()
from pyspark.sql.functions import datediff 
df = df.withColumn("no_days", datediff(col("deceased_date"),col("confirmed_date")))
df.show()

### Add a is_male column if male then it should yield true, else then False

In [None]:
from pyspark.sql.functions import when
df = df.withColumn("is_male",when(col("sex") == 'male','true').otherwise('false'))
df.show()

### Add a is_dead column if patient state is not released then it should yield true, else then False

- Use <b>UDF</b> to perform this task. 
- However, UDF is not recommended there is no built in function can do the required operation.
- UDF is slower than built in functions.

In [None]:
def statrel(stat):
  x=False
  if stat != 'released':
    x=True
  return x

In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
convertUDF = udf(lambda z: statrel(z),StringType())

In [None]:
df = df.withColumn("is_dead",convertUDF(col("state")))
df.show()

### Change the ages to bins from 10s, 0s, 10s, 20s,.etc to 0,10, 20

In [None]:
from pyspark.sql.functions import translate
from pyspark.sql.types import IntegerType
df = df.withColumn('age', translate('age','s',''))
df.show()

### Change age, and no_days  to be typecasted as Double

In [None]:
from pyspark.sql.types import IntegerType , DoubleType
df = df.withColumn('age', col('age').cast(DoubleType()))
df = df.withColumn('no_days', col('no_days').cast(DoubleType()))
df.show()

### Drop the columns
["patient_id","sex","infected_by","contact_number","released_date","state",
"symptom_onset_date","confirmed_date","deceased_date","country","no_days",
"city","infection_case"]

In [None]:
cols = ("patient_id","sex","infected_by","contact_number","released_date","state", "symptom_onset_date","confirmed_date","deceased_date","country","no_days", "city","infection_case")
df = df.drop(*cols)
df.show()

### Recount the number of nulls now

In [None]:
df.select([count(when(col(c).isNull(),c)).alias(c)
                    for c in df.columns])\
                    .show()

## Now do the same but using SQL select statement

### From the original Patient DataFrame, Create a temporary view (table).

### Use SELECT statement to select all columns from the dataframe and show the output.

### *Using SQL commands*, limit the output to only 5 rows 

### Select the count of males and females in the dataset

### How many people did survive, and how many didn't?

### Now, let's perform some preprocessing using SQL:
1. Convert *age* column to double after removing the 's' at the end -- *hint: check SUBSTRING method*
2. Select only the following columns: `['sex', 'age', 'province', 'state']`
3. Store the result of the query in a new dataframe

## Machine Learning 
### Create a pipeline model to predict is_dead and evaluate the performance.
- Use <b>StringIndexer</b> to transform <b>string</b> data type to indices.
- Use <b>OneHotEncoder</b> to deal with categorical values.
- Use <b>Imputer</b> to fill missing data with mean.

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler

In [None]:
# indexer = StringIndexer().setInputCol("is_dead").setOutputCol("is_dead1")
# df = indexer.fit(df).transform(df)
# df.show()

In [None]:
# df = StringIndexer().setInputCol("is_male").setOutputCol("is_male1").fit(df).transform(df)
# df.show()

In [None]:
# df = StringIndexer().setInputCol("province").setOutputCol("province1").fit(df).transform(df)
# df.show()

In [None]:
# cols = ("province","is_male",'is_dead')
# df = df.drop(*cols)
# df.show()

In [None]:
cols = ["province","is_male"]

indexers = [
    StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
    for c in cols
]

encoders = [
    OneHotEncoder(
        inputCol=indexer.getOutputCol(),
        outputCol="{0}_encoded".format(indexer.getOutputCol())) 
    for indexer in indexers
]

assembler = VectorAssembler(
    inputCols=[encoder.getOutputCol() for encoder in encoders],
    outputCol="features"
)


pipeline = Pipeline(stages=indexers + encoders + [assembler])
pipeline.fit(df).transform(df).show()