# NumPy, Pandas & Random Forest

##### 2 Important Modules

* To do any kind of stats, data science or machine learning in Python we need both Pandas and NumPy.
* Many of the statistical programming/data processing operations we do in R have counterparts in these two modules.
* Implementations of many common algorithms and statistical models in Python take NumPy or Pandas objects as inputs.

### 1. Numpy

* Describes itself as "the fundamental package for scientific computing"
* Along with the objects it provides, it also includes "fast" (vectorized) mathematical operations that we do not get with built-in Python

In [1]:
# ! pip3 install numpy

In [2]:
import numpy as np

#### 1.1 Convenient math operations

##### Descriptive stats

In [3]:
testlist = [4,5,6,7]

In [4]:
float(np.mean(testlist))

5.5

##### Things we couldn't do before

In [5]:
int(1.9999)

1

In [6]:
int(np.round(1.9999))

2

#### 1.2 NumPy arrays

- NumPy's main object is the multidimensional array.
    - These are faster/more efficient than lists, and we can perform math (including matrix operations) on them.
- It is a table of elements, usually numbers, all of the same type, indexed by a tuple of non-negative integers. The array's dimensons are immutable.
- NumPy’s array class is called `ndarray`. It is also known by the alias `array`.

##### Creating arrays

* From list

In [7]:
a = np.array([5, 10, 7, 2])
a

array([ 5, 10,  7,  2])

In [8]:
type(a)

numpy.ndarray

* From tuple

In [9]:
a = np.array((5, 10, 7, 2))
a

array([ 5, 10,  7,  2])

In [10]:
type(a)

numpy.ndarray

##### Array attributes

In [11]:
# dir(a)

In [12]:
a.dtype

dtype('int64')

In [13]:
a.sum()

np.int64(24)

In [14]:
a.shape

(4,)

In [15]:
a.size

4

In [16]:
a[0]

np.int64(5)

##### Integer vs. Float arrays

In [17]:
b = np.array([0.8, 7.66, 9.2, 1.76])

In [18]:
b.dtype

dtype('float64')

In [19]:
b.round()

array([1., 8., 9., 2.])

##### Recall that math operations are now vectorized
* We can expect them to apply elementwise for arrays of the same dimensions

In [20]:
a * a

array([ 25, 100,  49,   4])

In [24]:
a - b

array([ 4.2 ,  2.34, -2.2 ,  0.24])

In [25]:
a / b

array([6.25      , 1.30548303, 0.76086957, 1.13636364])

##### Mixing integers and floats

In [26]:
remix = np.array([8, 7.2, 0, 9])

In [27]:
remix.dtype

dtype('float64')

In [28]:
result = a * remix

In [29]:
result

array([40., 72.,  0., 18.])

In [30]:
result.astype('int64')

array([40, 72,  0, 18])

* What is this doing?

In [31]:
np.dot(a,b)

np.float64(148.51999999999998)

In [32]:
(a[0]* b[0]) + (a[1] * b[1]) + (a[2]* b[2]) + (a[3] * b[3])

np.float64(148.52)

##### Arrays of mixed elements

In [33]:
mix = np.array([5, "cool!", 7.5, True])

* What do you notice about the types of elements in `mix`?

In [34]:
mix

array(['5', 'cool!', '7.5', 'True'], dtype='<U32')

In [35]:
mix.dtype

dtype('<U32')

In [36]:
type(mix)

numpy.ndarray

In [37]:
mix + mix

array(['55', 'cool!cool!', '7.57.5', 'TrueTrue'], dtype='<U64')

#### 1.3 2-D Arrays

In [38]:
a2 = np.array([5, 10, 7, 2]).reshape(2,2)

In [39]:
a2

array([[ 5, 10],
       [ 7,  2]])

In [40]:
b2 = b.reshape(2,2)

In [41]:
b2

array([[0.8 , 7.66],
       [9.2 , 1.76]])

##### New attributes relevant

In [42]:
a2.shape

(2, 2)

In [43]:
a2.size

4

In [44]:
a2.ravel()

array([ 5, 10,  7,  2])

In [45]:
a2.ravel().shape

(4,)

In [46]:
a2.T

array([[ 5,  7],
       [10,  2]])

In [47]:
a2.trace()

np.int64(7)

In [48]:
a2.diagonal()

array([5, 2])

##### `arange` and `linspace` Functions

In [49]:
c = np.arange(16).reshape(4,4)
c

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [50]:
c2 = np.arange(2,16,4).reshape(2,2)
c2

array([[ 2,  6],
       [10, 14]])

In [51]:
c22 = np.arange(0,1,0.25).reshape(2,2)
c22

array([[0.  , 0.25],
       [0.5 , 0.75]])

In [52]:
c3 = np.linspace(0,1,4).reshape(2,2)
c3

array([[0.        , 0.33333333],
       [0.66666667, 1.        ]])

In [53]:
c32 = np.linspace(0,4,9).reshape(3,3)
c32

array([[0. , 0.5, 1. ],
       [1.5, 2. , 2.5],
       [3. , 3.5, 4. ]])

* The `logspace()`, `zeros()`, and `ones()` functions can also be useful for creating arrays.

##### Stacking arrays

In [54]:
cmix = np.vstack([a2,b2])
cmix

array([[ 5.  , 10.  ],
       [ 7.  ,  2.  ],
       [ 0.8 ,  7.66],
       [ 9.2 ,  1.76]])

In [55]:
cmix.shape

(4, 2)

In [56]:
dmix = np.hstack([a2,b2])
dmix

array([[ 5.  , 10.  ,  0.8 ,  7.66],
       [ 7.  ,  2.  ,  9.2 ,  1.76]])

In [57]:
dmix.shape

(2, 4)

##### Indexing 2-D arrays

In [58]:
dmix[0]

array([ 5.  , 10.  ,  0.8 ,  7.66])

In [59]:
dmix[:,3]

array([7.66, 1.76])

In [60]:
dmix[:,3].reshape(2,1)

array([[7.66],
       [1.76]])

In [61]:
dmix[0,1:]

array([10.  ,  0.8 ,  7.66])

#### 1.4 Linear algebra

In [62]:
from numpy import linalg

* Docs: https://numpy.org/doc/stable/reference/routines.linalg.html
* More: https://www.geeksforgeeks.org/numpy-linear-algebra/#

##### Eigenvectors & eigenvalues

In [63]:
linalg.eig(a2)

EigResult(eigenvalues=array([12., -5.]), eigenvectors=array([[ 0.81923192, -0.70710678],
       [ 0.57346234,  0.70710678]]))

In [64]:
linalg.eigvals(a2)

array([12., -5.])

##### Determinant & rank

In [65]:
linalg.det(a2)

np.float64(-59.999999999999986)

In [66]:
linalg.matrix_rank(a2)

np.int64(2)

##### Inverses

In [67]:
linalg.inv(a2)

array([[-0.03333333,  0.16666667],
       [ 0.11666667, -0.08333333]])

##### Matrix multiplication

In [68]:
e = np.array([9,3,2,7]).reshape(2,2)
e

array([[9, 3],
       [2, 7]])

In [69]:
f = np.array([5,5,2,3]).reshape(2,2)
f

array([[5, 5],
       [2, 3]])

In [70]:
e * f

array([[45, 15],
       [ 4, 21]])

In [71]:
np.dot(e, f)

array([[51, 54],
       [24, 31]])

### 2. Pandas

* Pandas was built on top of NumPy
* Python's dataframe object is provided by Pandas
* Need this module to do any kind of data manipulation, cleaning, or analysis of the sort we are used to with dataframes
* Syntax similar to `data.table` in `R`

In [72]:
# ! pip3 install pandas

In [73]:
import pandas as pd
import os

#### 2.2 Getting dataframes

##### Reading CSVs

In [74]:
redwines = pd.read_csv('winequality-red.csv')

* What's happening?

In [75]:
redwines.head()

Unnamed: 0,"fixed acidity;""volatile acidity"";""citric acid"";""residual sugar"";""chlorides"";""free sulfur dioxide"";""total sulfur dioxide"";""density"";""pH"";""sulphates"";""alcohol"";""quality"""
0,7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
1,7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
2,7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;...
3,11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58...
4,7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5


* Let's try again

In [76]:
redwines = pd.read_csv('winequality-red.csv', delimiter=';')

In [77]:
redwines.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


* Many things can go wrong; remember to check docs first when they do
    * https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv

##### Dict to DataFrame

In [78]:
sweetness = {'type':['moscato','pinot noir', 'pinot gris',
                     'sauvignon blanc','chardonnay','merlot',
                     'riesling', 'malbec', 'zinfandel',
                     'port', 'lambrusco dolce'],
             'color': ['white','red','white',
                      'white','white','red',
                      'white', 'red', 'red',
                      'red','red'],
            'sweet_level':[0.16,0.05,0.04,
                          0.02, 0.07, 0.10,
                          0.14, 0.11, 0.15,
                          0.18, 0.16]}

In [79]:
sweet_df = pd.DataFrame(sweetness)

In [80]:
sweet_df

Unnamed: 0,type,color,sweet_level
0,moscato,white,0.16
1,pinot noir,red,0.05
2,pinot gris,white,0.04
3,sauvignon blanc,white,0.02
4,chardonnay,white,0.07
5,merlot,red,0.1
6,riesling,white,0.14
7,malbec,red,0.11
8,zinfandel,red,0.15
9,port,red,0.18


#### 2.3 Exploring pandas dataframes

##### Dataframe attributes

In [81]:
# dir(redwines)

In [82]:
type(redwines)

pandas.core.frame.DataFrame

In [83]:
redwines.dtypes

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object

In [84]:
redwines.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [85]:
sweet_df.dtypes

type            object
color           object
sweet_level    float64
dtype: object

In [86]:
sweet_df.describe()

Unnamed: 0,sweet_level
count,11.0
mean,0.107273
std,0.055334
min,0.02
25%,0.06
50%,0.11
75%,0.155
max,0.18


In [87]:
redwines.duplicated()

0       False
1       False
2       False
3       False
4        True
        ...  
1594    False
1595    False
1596     True
1597    False
1598    False
Length: 1599, dtype: bool

In [88]:
redwines[redwines.duplicated()]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
4,7.4,0.700,0.00,1.90,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
11,7.5,0.500,0.36,6.10,0.071,17.0,102.0,0.99780,3.35,0.80,10.5,5
27,7.9,0.430,0.21,1.60,0.106,10.0,37.0,0.99660,3.17,0.91,9.5,5
40,7.3,0.450,0.36,5.90,0.074,12.0,87.0,0.99780,3.33,0.83,10.5,5
65,7.2,0.725,0.05,4.65,0.086,4.0,11.0,0.99620,3.41,0.39,10.9,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1563,7.2,0.695,0.13,2.00,0.076,12.0,20.0,0.99546,3.29,0.54,10.1,5
1564,7.2,0.695,0.13,2.00,0.076,12.0,20.0,0.99546,3.29,0.54,10.1,5
1567,7.2,0.695,0.13,2.00,0.076,12.0,20.0,0.99546,3.29,0.54,10.1,5
1581,6.2,0.560,0.09,1.70,0.053,24.0,32.0,0.99402,3.54,0.60,11.3,5


##### Assigning and modifying columns

In [89]:
redwines.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [90]:
redwines['color'] = 'red'

In [91]:
redwines = redwines.rename(columns={'pH': 'pH_level', 'quality':'rank'})
redwines.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH_level,sulphates,alcohol,rank,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


* We can use regular expressions to modify column names

In [92]:
redwines.columns = redwines.columns.str.replace(r"\s", "_", regex=True)

In [93]:
redwines.columns

Index(['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
       'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
       'pH_level', 'sulphates', 'alcohol', 'rank', 'color'],
      dtype='object')

#### 2.4 Summarizing/aggregating pandas dataframes

In [94]:
sweet_df.groupby(['color']).mean(['sweet_level'])

Unnamed: 0_level_0,sweet_level
color,Unnamed: 1_level_1
red,0.125
white,0.086


In [95]:
sweet_df.groupby(['color'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1201d82f0>

* Reset the index of summarized dfs to create new df objects

In [96]:
wine_summary = sweet_df.groupby(['color']).mean(['sweet_level']).reset_index()
wine_summary

Unnamed: 0,color,sweet_level
0,red,0.125
1,white,0.086


In [97]:
redwines['pH_level'].value_counts()

pH_level
3.30    57
3.36    56
3.26    53
3.38    48
3.39    48
        ..
3.75     1
2.74     1
3.70     1
3.85     1
2.90     1
Name: count, Length: 89, dtype: int64

In [98]:
redwines['rank'].value_counts()

rank
5    681
6    638
7    199
4     53
8     18
3     10
Name: count, dtype: int64

In [99]:
sweet_df.sort_values(by = "sweet_level", ascending = False)

Unnamed: 0,type,color,sweet_level
9,port,red,0.18
0,moscato,white,0.16
10,lambrusco dolce,red,0.16
8,zinfandel,red,0.15
6,riesling,white,0.14
7,malbec,red,0.11
5,merlot,red,0.1
4,chardonnay,white,0.07
1,pinot noir,red,0.05
2,pinot gris,white,0.04


##### Subsetting

In [100]:
# All rows from column "chlorides" & "density"
redwines.loc[:,["chlorides", "density"]] 

Unnamed: 0,chlorides,density
0,0.076,0.99780
1,0.098,0.99680
2,0.092,0.99700
3,0.075,0.99800
4,0.076,0.99780
...,...,...
1594,0.090,0.99490
1595,0.062,0.99512
1596,0.076,0.99574
1597,0.075,0.99547


* Equivalent

In [101]:
redwines[["chlorides", "density"]] 

Unnamed: 0,chlorides,density
0,0.076,0.99780
1,0.098,0.99680
2,0.092,0.99700
3,0.075,0.99800
4,0.076,0.99780
...,...,...
1594,0.090,0.99490
1595,0.062,0.99512
1596,0.076,0.99574
1597,0.075,0.99547


In [102]:
redwines.loc[0:2, ["chlorides"]] # First three rows from column "C"

Unnamed: 0,chlorides
0,0.076
1,0.098
2,0.092


In [103]:
# First 2 rows and 3 colunms
redwines.iloc[0:2, 0:3]

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid
0,7.4,0.7,0.0
1,7.8,0.88,0.0


Difference between `.iloc` and `.loc`:
- `.loc`: gets rows / columns using labels
- `.iloc`: gets rows / columns using positions (integer location indexing)

In [106]:
redwines[(redwines['rank'] == 8) & (redwines['residual_sugar'] > 2)]

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH_level,sulphates,alcohol,rank,color
267,7.9,0.35,0.46,3.6,0.078,15.0,37.0,0.9973,3.35,0.86,12.8,8,red
278,10.3,0.32,0.45,6.4,0.073,5.0,13.0,0.9976,3.23,0.82,12.6,8,red
440,12.6,0.31,0.72,2.2,0.072,6.0,29.0,0.9987,2.88,0.82,9.8,8,red
455,11.3,0.62,0.67,5.2,0.086,6.0,19.0,0.9988,3.22,0.69,13.4,8,red
481,9.4,0.3,0.56,2.8,0.08,6.0,17.0,0.9964,3.15,0.92,11.7,8,red
495,10.7,0.35,0.53,2.6,0.07,5.0,16.0,0.9972,3.15,0.65,11.0,8,red
498,10.7,0.35,0.53,2.6,0.07,5.0,16.0,0.9972,3.15,0.65,11.0,8,red
828,7.8,0.57,0.09,2.3,0.065,34.0,45.0,0.99417,3.46,0.74,12.7,8,red
1120,7.9,0.54,0.34,2.5,0.076,8.0,17.0,0.99235,3.2,0.72,13.1,8,red


In [107]:
redwines[redwines['rank'] == 8 & redwines['residual_sugar'] > 2]

TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]

In [109]:
redwines[(redwines['rank'] == 8) & (redwines['residual_sugar'] > 2)].sort_values(['residual_sugar'], 
                                                                                 ascending=False)

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH_level,sulphates,alcohol,rank,color
278,10.3,0.32,0.45,6.4,0.073,5.0,13.0,0.9976,3.23,0.82,12.6,8,red
455,11.3,0.62,0.67,5.2,0.086,6.0,19.0,0.9988,3.22,0.69,13.4,8,red
267,7.9,0.35,0.46,3.6,0.078,15.0,37.0,0.9973,3.35,0.86,12.8,8,red
481,9.4,0.3,0.56,2.8,0.08,6.0,17.0,0.9964,3.15,0.92,11.7,8,red
495,10.7,0.35,0.53,2.6,0.07,5.0,16.0,0.9972,3.15,0.65,11.0,8,red
498,10.7,0.35,0.53,2.6,0.07,5.0,16.0,0.9972,3.15,0.65,11.0,8,red
1120,7.9,0.54,0.34,2.5,0.076,8.0,17.0,0.99235,3.2,0.72,13.1,8,red
828,7.8,0.57,0.09,2.3,0.065,34.0,45.0,0.99417,3.46,0.74,12.7,8,red
440,12.6,0.31,0.72,2.2,0.072,6.0,29.0,0.9987,2.88,0.82,9.8,8,red


#### 2.5 Light data cleaning with pandas

In [110]:
whitewines = pd.read_csv('winequality-white.csv', delimiter=';')
whitewines['color'] = 'white'
whitewines = whitewines.rename(columns={'pH': 'pH_level', 'quality':'rank'})
whitewines.columns = whitewines.columns.str.replace(r"\s", "_", regex=True)
whitewines.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH_level,sulphates,alcohol,rank,color
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white


In [111]:
allwines = pd.concat([redwines,whitewines])

In [112]:
allwines.shape

(6497, 13)

In [113]:
pd.get_dummies(allwines['color'])

Unnamed: 0,red,white
0,True,False
1,True,False
2,True,False
3,True,False
4,True,False
...,...,...
4893,False,True
4894,False,True
4895,False,True
4896,False,True


In [114]:
allwines = pd.concat([allwines, pd.get_dummies(allwines['color'], drop_first=True, dtype=float)], axis=1)

In [115]:
allwines.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH_level,sulphates,alcohol,rank,color,white
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,0.0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red,0.0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red,0.0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red,0.0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,0.0


In [117]:
allwines['avg_acidity'] = allwines[['fixed_acidity', 'volatile_acidity', 'citric_acid']].mean(axis=1)

In [118]:
allwines['avg_acidity'].describe()

count    6497.000000
mean        2.624535
std         0.462675
min         1.376667
25%         2.340000
50%         2.533333
75%         2.793333
max         5.681667
Name: avg_acidity, dtype: float64

* More useful data cleaning functions: https://pandas.pydata.org/docs/reference/general_functions.html

##### Outputs

* CSV

In [119]:
allwines[['avg_acidity', 'white','rank']].to_csv('cleaned_wine_data.csv')

* NumPy Array

In [120]:
winearray = allwines[['avg_acidity', 'white','rank']].to_numpy()

In [121]:
winearray[3500:3510,:]

array([[2.63666667, 1.        , 5.        ],
       [3.12666667, 1.        , 6.        ],
       [2.80333333, 1.        , 5.        ],
       [2.83      , 1.        , 6.        ],
       [2.82      , 1.        , 6.        ],
       [2.21      , 1.        , 5.        ],
       [2.85666667, 1.        , 7.        ],
       [2.64      , 1.        , 7.        ],
       [2.69      , 1.        , 5.        ],
       [2.40333333, 1.        , 5.        ]])

### 3. Random Forest

#### Installations & imports

In [122]:
# ! pip3 install scikit-learn

In [123]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

In [124]:
df1 = allwines[['avg_acidity', 'white','rank']]

##### Check for missing values

In [125]:
print(df1.isnull().sum())

avg_acidity    0
white          0
rank           0
dtype: int64


##### Define X and y 

In [126]:
X = df1.drop(["rank"], axis=1)
y = df1["rank"]

##### Split train and test sets

In [127]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, shuffle=True
)

##### Train classifier

In [128]:
rfc = RandomForestClassifier(random_state=42)

##### Basic diagnostics

In [129]:
accuracies = cross_val_score(rfc, X_train, y_train, cv=5)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)



In [130]:
import numpy as np

In [131]:
print("Train Score:", np.mean(accuracies))
print("Test Score:", rfc.score(X_test, y_test))

Train Score: 0.4210152143333087
Test Score: 0.43846153846153846
