# scikit-learn

## Overview

Scikit-learn [1] is a Python package containing a wide selection of tools for machine learning. Machine learning is a technique of using computers to make predtictions on unseen data based on either previously seen data or algorithms which find patterns within data. In addition to a large collection of data analysis algorithms, the package also provides tools for data preprocessing and dimensionality reduction; model selection and hyperparameter tuning - that is, finding an algorithm and a set of parameters to that algorithm that enable it to best perform on a particular dataset.


In [1]:
# imports
import pandas as pd
import os


In [16]:
# Data urls
red_wine_quality = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csvv'
white_wine_quality = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csvv'

# Local paths
rw_csv = 'data/red_wine_quality.csv'
ww_csv = 'data/white_wine_quality.csv'

def getData(data_url, local_data_path, delimiter=','):
    """Function to download well-formed csv datasets from the web and
       load into a pandas DataFrame. The function attempts to retrieve 
       the data from `data_url` and, if it fails, will attempt to load
       the data from `local_data_path`. If it succeeds in downloading 
       the data from the web, it will check to see if a file exists at 
       `local_data_path`, and if not will write the data as a csv file 
       using pandas.to_csv. Pandas needs to be imported as pd to use 
       this function.

    Args:
        data_url (str): URL of CSV file

        local_data_path (str): Path, including filename where data may 
                               be written to locally. May be relative 
                               or absolute.

    Returns:
        pandas.DataFrame: A pandas DataFrame holding the contents of the
                          CSV file (parsed using pandas.to_csv())
    """
    try:
        df = pd.read_csv(data_url, delimiter=delimiter)
        if not os.path.isfile(local_data_path):
            print("Saving CSV file to data directory")
            df.to_csv(local_data_path, index=False)
        else:
            print("CSV file already exists in data directory")
    except IOError:
        print("Reading from local CSV file")
        df = pd.read_csv(local_data_path)
    
    return df


In [17]:
rw = getData(red_wine_quality, rw_csv, delimiter=';')
rw.head()

TypeError: getWineData() got an unexpected keyword argument 'delimiter'

In [15]:
ww = getData(white_wine_quality, ww_csv, delimiter=';')
ww.head()

Reading from local CSV file


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


## Algorithm demonstrations

### 1 Support Vector Machines

### 2 Decision Trees

### 3 Random Forests

## References