# Getting Started with SciKit-Learn

This notebook will go through hpw to install scikit-learn and go over some basics 

### Installing libraries

In [27]:
!pip install -U scikit-learn

Requirement already up-to-date: scikit-learn in /Users/loonycorn/anaconda3/lib/python3.7/site-packages (0.21.3)


In [2]:
import sklearn

import numpy as np
import pandas as pd

In [3]:
print(sklearn.__version__)

0.21.3


In [4]:
print("NumPy", np.__version__)
print("Pandas", pd.__version__)

NumPy 1.15.4
Pandas 0.24.0


This is using sklearn's datasets to go through different types data sets and functions we can use

In [5]:
from sklearn.datasets import load_diabetes

In [6]:
diabetes_dataset = load_diabetes()

In [7]:
diabetes_dataset.keys()

dict_keys(['data', 'target', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])

In [8]:
print(diabetes_dataset.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bra

##### 's1', 's2', 's3', 's4', 's5', 's6' are six blood serum measurements

In [9]:
diabetes_dataset.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [10]:
diabetes_dataset.data.shape

(442, 10)

In [11]:
diabetes_dataset.target.shape

(442,)

This is how we can create a dataframe from our dataset

In [12]:
df_features = pd.DataFrame(diabetes_dataset.data, columns=diabetes_dataset.feature_names)

df_target = pd.DataFrame(diabetes_dataset.target, columns=["disease progression"])

In [13]:
df = pd.concat([df_features, df_target], axis=1)

In [14]:
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,disease progression
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


In [15]:
df.shape

(442, 11)

This is another dataset from skLearn

In [16]:
from sklearn.datasets import load_iris

In [17]:
iris_data = load_iris()

In [18]:
iris_data.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [19]:
print(iris_data.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [20]:
iris_data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [21]:
iris_data.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [22]:
iris_data.data.shape

(150, 4)

In [23]:
iris_data.data[:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

In [24]:
iris_data.target.shape

(150,)