# Wine

### Introduction:

This exercise is a adaptation from the UCI Wine dataset.
The only pupose is to practice deleting data with pandas.

### Step 1. Import the necessary libraries

In [1]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("wine").getOrCreate()
spark

### Step 2. Import the dataset from this [address](https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data). 

In [2]:
from pyspark import SparkFiles

### Step 3. Assign it to a variable called wine

In [5]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"

spark.sparkContext.addFile(url)

wine = spark.read.csv(SparkFiles.get("wine.data"), header=False, inferSchema=True, sep=',')

In [6]:
wine.show(5)

+---+-----+----+----+----+---+----+----+----+----+----+----+----+----+
|_c0|  _c1| _c2| _c3| _c4|_c5| _c6| _c7| _c8| _c9|_c10|_c11|_c12|_c13|
+---+-----+----+----+----+---+----+----+----+----+----+----+----+----+
|  1|14.23|1.71|2.43|15.6|127| 2.8|3.06|0.28|2.29|5.64|1.04|3.92|1065|
|  1| 13.2|1.78|2.14|11.2|100|2.65|2.76|0.26|1.28|4.38|1.05| 3.4|1050|
|  1|13.16|2.36|2.67|18.6|101| 2.8|3.24| 0.3|2.81|5.68|1.03|3.17|1185|
|  1|14.37|1.95| 2.5|16.8|113|3.85|3.49|0.24|2.18| 7.8|0.86|3.45|1480|
|  1|13.24|2.59|2.87|21.0|118| 2.8|2.69|0.39|1.82|4.32|1.04|2.93| 735|
+---+-----+----+----+----+---+----+----+----+----+----+----+----+----+
only showing top 5 rows



### Step 4. Delete the first, fourth, seventh, nineth, eleventh, thirteenth and fourteenth columns

In [8]:
cols_to_drop = ["_c0", "_c3","_c6", "_c8", "_c10", "_c12", "_c13"]
wine = wine.drop(*cols_to_drop)
wine.show(5)

+-----+----+----+---+----+----+----+
|  _c1| _c2| _c4|_c5| _c7| _c9|_c11|
+-----+----+----+---+----+----+----+
|14.23|1.71|15.6|127|3.06|2.29|1.04|
| 13.2|1.78|11.2|100|2.76|1.28|1.05|
|13.16|2.36|18.6|101|3.24|2.81|1.03|
|14.37|1.95|16.8|113|3.49|2.18|0.86|
|13.24|2.59|21.0|118|2.69|1.82|1.04|
+-----+----+----+---+----+----+----+
only showing top 5 rows



### Step 5. Assign the columns as below:

The attributes are (donated by Riccardo Leardi, riclea '@' anchem.unige.it):  
1) alcohol  
2) malic_acid  
3) alcalinity_of_ash  
4) magnesium  
5) flavanoids  
6) proanthocyanins  
7) hue 

In [9]:
new_names = [ "alcohol",
             "malic_acid",
             "alcalinity_of_ash",
             "magnesium",
             "flavanoids",
             "proanthocyanins",
             "hue" ]

In [12]:
wine = wine.toDF(*new_names)
wine.show(5)

+-------+----------+-----------------+---------+----------+---------------+----+
|alcohol|malic_acid|alcalinity_of_ash|magnesium|flavanoids|proanthocyanins| hue|
+-------+----------+-----------------+---------+----------+---------------+----+
|  14.23|      1.71|             15.6|      127|      3.06|           2.29|1.04|
|   13.2|      1.78|             11.2|      100|      2.76|           1.28|1.05|
|  13.16|      2.36|             18.6|      101|      3.24|           2.81|1.03|
|  14.37|      1.95|             16.8|      113|      3.49|           2.18|0.86|
|  13.24|      2.59|             21.0|      118|      2.69|           1.82|1.04|
+-------+----------+-----------------+---------+----------+---------------+----+
only showing top 5 rows



### Step 6. Set the values of the first 3 rows from alcohol as NaN

In [16]:
#trying to find an answer

### Step 7. Now set the value of the rows 3 and 4 of magnesium as NaN

In [17]:
#trying to find an answer

### Step 8. Fill the value of NaN with the number 10 in alcohol and 100 in magnesium

In [18]:
#trying to find an answer

### Step 9. Count the number of missing values

In [13]:
import pyspark.sql.functions as F
def count_missings(spark_df,sort=True):
    """
    Counts number of nulls and nans in each column
    """
    df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas()

    if len(df) == 0:
        print("There are no any missing values!")
        return None

    if sort:
        return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False)

    return df

In [14]:
count_missings(wine)

Unnamed: 0,count
alcohol,0
malic_acid,0
alcalinity_of_ash,0
magnesium,0
flavanoids,0
proanthocyanins,0
hue,0


### Step 10.  Create an array of 10 random numbers up until 10

In [19]:
#trying to find an answer

### Step 11.  Use random numbers you generated as an index and assign NaN value to each of cell.

In [20]:
#trying to find an answer

### Step 12.  How many missing values do we have?

In [21]:
#trying to find an answer

### Step 13. Delete the rows that contain missing values

In [15]:
wine.dropna().show(5)

+-------+----------+-----------------+---------+----------+---------------+----+
|alcohol|malic_acid|alcalinity_of_ash|magnesium|flavanoids|proanthocyanins| hue|
+-------+----------+-----------------+---------+----------+---------------+----+
|  14.23|      1.71|             15.6|      127|      3.06|           2.29|1.04|
|   13.2|      1.78|             11.2|      100|      2.76|           1.28|1.05|
|  13.16|      2.36|             18.6|      101|      3.24|           2.81|1.03|
|  14.37|      1.95|             16.8|      113|      3.49|           2.18|0.86|
|  13.24|      2.59|             21.0|      118|      2.69|           1.82|1.04|
+-------+----------+-----------------+---------+----------+---------------+----+
only showing top 5 rows



### Step 14. Print only the non-null values in alcohol

In [22]:
#trying to find an answer

### Step 15.  Reset the index, so it starts with 0 again

In [23]:
#trying to find an answer

### BONUS: Create your own question and answer it.

In [24]:
#trying to find an answer