Before we start: https://www.when2meet.com/?10156131-vWB6B


Here's an overview of our plan for the weekly curriculum:

* Week 3: Introduction
* Week 4: Data retrieval and preparation
* Week 5: Data exploration and visualization
* Week 6: Modeling and machine learning
* Week 7: Supervised models
* Week 8: Unsupervised models 
* Week 9: Neural networks 

Without further ado, let's begin!

# What is data science?

* "Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data produced today. It adds methods from computer science to the repertoire of statistics." - *Introducing Data Science*

* "Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data." - *Wikipedia*

To put it simply, *data science* is the *science* of *data*.

*Data* is any information that we can collect through observation, and it comes in many different types. For example, data can be numbers or language, structured (e.g. in spreadsheets, tables, or data frames) or unstructured, or other formats (graphs/networks, videos/images).

*Science* is the process of building and organizing our knowledge of the world in the form of testable explanations and predictions.

So basically, the goal of data science is to increase our understanding of something (anything!) by analyzing the data that comes from it.

![](https://www.nairaland.com/attachments/8401767_91e46ae702f14a88850af9fc4a9a85db_pngb329a184b869ee8ed3c775c0657656c1)

# The Data Science Process

<img src="https://miro.medium.com/max/1200/1*eE8DP4biqtaIK3aIy1S2zA.png" width="800">

###Obtaining Data

The first fundamental steps of data science begins with asking a question and obtaining relevant data.

e.g. *Who is the most valuable player in the NBA from the years 1997-2007?*

Some of these data sets can be easily found online (Kaggle, data.gov, data.world, etc.). Sometimes you will be provided with a data set, but if you are working on a personal project, often times you will have to find relevant dataset(s) or scrape data off websites (we will talk about this more in depth next week).

###Data Cleansing 

Data collected by companies or the government is often formatted in a such a way that makes the raw, unchanged data imcompatible with machine learning tools. In order to clean the data so that our computers can understand the data we input, we must manipulate the dataset to match the criteria of machine learning tools.

This includes changing the format of certain cells within observations (i.e. removing the dollar sign from a price column), removing outliers that would adversely affect our machine learning algorithms, removing NA observations, imputing missing values with the mean of the predictor.

###Exploratory Data Analysis (EDA)

Before we apply machine learning algorithms to our data, we have to understand the variables within it -- EDA allows us to find significant trends utilizing statistical methodology and create visualizations of data trends that would not otherwise be obvious just looking at the data. Often times we plot some variables within the data to understand the structure of the data (spread, center, etc.), which may give insight on what machine learning algorithms we might use.


<img src="https://eglouberman.github.io/MLB-hit-predictor/docs/images/corr2.png" width="400"> 
<img src="https://raw.githubusercontent.com/the-data-science-union/DSU-Team-Music/master/Visualizations/Number%20of%20Followers.png" width="800">

Besides visualizing data, we might also want to find relevant variables through general linear models or correlation matrices (and also remove multicollinearity).


###Model Fitting

Once we've explored the data set and determined some variables that may be significant for our predictions (when applicable), we can move on to fitting our data to a model. Machine Learning allows us to utilize real-world data to make predictions and forecasts.

Fitting models isn't always as easy as pulling a pre-written function out of a package and applying it to certain variables - there are often other things to consider, like bias-variance tradeoff, flexibility vs interpretability, and overfitting models (all of which we will go over at a later date).


<img src="https://blogs.sas.com/content/subconsciousmusings/files/2017/04/machine-learning-cheet-sheet.png" width="800">

There are a lot of different types of Machine Learning Models - but most popular ones fit under two categories: supervised Machine Learning and unsupervised Machine Learning.

<img src="https://lawtomated.com/wp-content/uploads/2019/04/supVsUnsup.png" width="800">

# Getting started with Python

We encourage everyone to download and install Anaconda (https://www.anaconda.com/products/individual), which comes with Python v3.8 and Jupyter Notebook. We will be doing curriculum and projects mainly through Jupyter Notebook, and these are saved as .ipynb (IPython Notebook) files.

In the meantime, you can also get started with Google Colab (https://colab.research.google.com), which will allow you to write and run Jupyter Notebooks directly in Google Drive without installing anything.




## Basic Python

### Printing in Jupyter Notebook

As you would expect, you can print things in Python using Python's built-in `print()` function:

In [None]:
print("Hello, World!")

x = 3
print(x)

Hello, World!
3


However, in Jupyter, you can also display objects by simply typing an expression, rather than calling the `print()` function. For example:

In [None]:
x

3

In [None]:
2 * x**2 + 1

19

### Functions and control flow

I'm assuming all of you are familiar with functions, loops, etc. so I'm just going to give some examples of Python's syntax.

Usually, you define a function using `def`:

In [None]:
def add_one(x):
    print('Adding one:')
    return x+1

add_one(2)

Adding one:


3

But you can also define functions using `lambda` expressions, if the function's return value can be expressed in one line:

In [None]:
prod = lambda x, y: x * y

prod(3, 4)

12

And here is an example of a `for` loop with some `if` statements:

In [None]:
for i in range(17):
    if i % 3 == 0 and i % 5 == 0:
        print('FizzBuzz')
    elif i % 3 == 0:
        print('Fizz')
    elif i % 5 == 0:
        print('Buzz')
    else:
        print(i)

FizzBuzz
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16


Note that the `range` function caused the `for` loop to run starting at `i=0` up to and *excluding* 17.

Also, in Python, **indentation matters**. The contents of a function definition or a loop have to be indented at the same level, or else Python will throw a syntax error:

In [None]:
for i in range(17):
    if i % 3 == 0 and i % 5 == 0:
        print('FizzBuzz')
        elif i % 3 == 0:
            print('Fizz')
    elif i % 5 == 0:
        print('Buzz')
    else:
        print(i)

SyntaxError: ignored

### Built-in data structures

Python has four built-in data structures: lists, tuples, dictionaries, and sets.

For our purposes, we'll really only be using lists and dictionaries, but I'll briefly introduce all of them.

#### Lists

Lists contain a certain number of items in a specific order. The list items can be of any type (even other lists!), and different items can be of different types. You create lists with square brackets: `[]`.

In [None]:
my_list = [1, 'two', 3.0, [4, 5], 'six']
my_list

[1, 'two', 3.0, [4, 5], 'six']

You can get the length of a list using `len()`:

In [None]:
len(my_list)

5

And you can access and change the items of a list by enclosing its index in square brackets and using zero-based indexing:

In [None]:
my_list[0]

1

In [None]:
my_list[0] = "#1"
my_list

['#1', 'two', 3.0, [4, 5], 'six']

Python also has negative indexing, which counts backwards from the end of the list. So `my_list[-1]` returns the last element of the list.

In [None]:
my_list[-1]

'six'

You can also **slice** a list using slice notation: `list[start:stop(:step)]`. For example, `my_list[1:5:2]` returns a list with only the elements at indices 1 and 3:

In [None]:
my_list[1:5:2]

['two', [4, 5]]

The last `:step` part of slice notation is optional, and you can also omit `start`, `stop`, or `step`, and they will default to values of 0, `len(list)`, and 1 respectively. So, to return a list in reverse order, you can do the following:

In [None]:
my_list[::-1]

['six', [4, 5], 3.0, 'two', '#1']

Lists come with several built-in methods. The most commonly used method is probably `list.append()`, which appends the input value to the end of the list:

In [None]:
my_list.append(7.0)
my_list

['#1', 'two', 3.0, [4, 5], 'six', 7.0]

You can see all the `list` methods here: https://docs.python.org/3/tutorial/datastructures.html#more-on-lists

Python also has a powerful way to create lists called **list comprehension**. Here's an example:

In [None]:
[x**2 for x in range(10)]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

#### Dictionaries

Dictionaries map keys to values; each `value` is indexed by its respective `key`. You construct dictionaries by enclosing a list of `key: value` pairs within curly brackets `{}`. Here's an example:

In [None]:
my_dict = {"Name": "Mookie", "Age": 28, "Weight": 180, "BA": 0.292}
my_dict["Age"]

28

The keys of a dictionary don't all have to be the same type, but they must be *immutable* (strings, numbers, and tuples containing them are examples). So, I can add new keys to the dictionary such as:

In [None]:
my_dict[365] = "million"
my_dict

{365: 'million', 'Age': 28, 'BA': 0.292, 'Name': 'Mookie', 'Weight': 180}

You can also use dict comprehension to construct dictionaries:

In [None]:
{x: x**2 for x in range(10)}

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}

#### Tuples

Tuples are like lists, except that they're immutable: you can't manually change the values of their entries. You construct tuples using `()`.

In [None]:
my_tuple = (1, 'two', 3.0, [4, 5], 'six')

In [None]:
my_tuple[0] = 'one'

TypeError: ignored

Sometimes, you want to print multiple things in one Jupyter notebook cell. In that case, you can just separate the expressions with a comma and it'll return them as a tuple:

In [None]:
x = 2
y = 3

x, y

(2, 3)

There's also a trick called *tuple assignment*, which you can use to assign values to multiple variables at once:

In [None]:
a, b, c = 1, 2, 3
print('a =', a)
print('b =', b)
print('c =', c)

a = 1
b = 2
c = 3


#### Sets

We won't be using sets very often, but a set is an unordered collection with no duplicate elements. To create a set, use either the `set()` constructor on a list/tuple or curly brackets `{}`.

In [None]:
my_set = set([1, 2, 2, 3, "four"])
# {1, 2, 2, 3, "four"}
my_set

{1, 2, 3, 'four'}

Note that `my_set` only contains one 2, since its elements must be unique.

### Iterables

In previous examples of `for` loops, I used the `range` function, which returns an object of the special class `range`. 

In [None]:
type(range(5))

range

However, you can also loop through all of the data structures we just discussed (including lists and dictionaries), since all of them are *iterable* types. Here's the syntax:

In [None]:
for n in [2, 3, 5, 7]:
    print(n)

2
3
5
7


For dictionaries, you can only loop over the keys:

In [None]:
for key in {1: 'a', 2: 'b', 3: 'c'}:
    print(key)

1
2
3


### Classes

Python is an *object-oriented* programming language, meaning that it is designed so that most of the code we write manipulates objects.

**Everything in Python is an object** with a specific data `type`. *Classes* in Python provide the mechanism for creating *objects* or *instances* of each type, and defining each type's *attributes* and *methods*. Attributes are variables that belong to an object/class instance, and methods are functions that belong to an object/class instance.

Here's an example of how to use them, taken from the Python tutorial:

In [None]:
class MyClass:
    """A simple example class"""
    i = 12345

    def f(self):
        return 'hello world'

In [None]:
my_object = MyClass()

print(my_object.i) # print class attribute
my_object.f()      # run class method

12345


'hello world'

I won't go into more detail on how to write your own classes, but you can read more about it here: https://docs.python.org/3/tutorial/classes.html#a-first-look-at-classes.

All of the data science libraries we will be using introduce new classes, and we need to understand how to use them effectively for our data science needs.

## Introducing pandas

Here we'll be introducing pandas, one of the most popular Python packages for data analysis and manipulation.

Pandas is built on top of NumPy, a numerical computing library, so we'll import both:

In [None]:
import numpy as np
import pandas as pd

### The DataFrame class

The main feature of Pandas is a new data class called `DataFrame`. One way to initialize a `DataFrame` is to use the function `pd.DataFrame()`. Here's an example:

In [None]:
d = {'A': [1, 2, 3],
     'B': ['four', 'five', 'six']}

df = pd.DataFrame(d)
df

Unnamed: 0,A,B
0,1,four
1,2,five
2,3,six


A dictionary isn't the only possible input to `pd.DataFrame`: you can refer to the official Pandas documentation (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for more details. 

And keep this in mind: learning how to read documentation is key to being an effective data science programmer!

The columns of a `DataFrame` are stored as `pd.Series`, and you can access them either as attributes or index notation. For example, to access the `B` column of `df` by name, we can use either `df.B` or `df['B']`:

In [None]:
print(df.B)
print('')
print(df['B'])

0    four
1    five
2     six
Name: B, dtype: object

0    four
1    five
2     six
Name: B, dtype: object


In [None]:
type(df['B'])

pandas.core.series.Series

We recommend using `df['B']`, in case some column names contain spaces.

To view all the attributes/methods of a `DataFrame` in Jupyter Notebook, you can type the `DataFrame`'s name, a period, and hit TAB:

In [None]:
# df.<TAB>

### Reading and writing files

You can also create a `DataFrame` using data from an external source and one of pandas's built-in input/output functions, such as `pd.read_csv` and `pd.read_excel`. Here's an example, using a dataset on forest fires from the UCI Machine Learning respository:

In [None]:
ff = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv")
ff

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.00
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.00
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.00
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.00
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
512,4,3,aug,sun,81.6,56.7,665.6,1.9,27.8,32,2.7,0.0,6.44
513,2,4,aug,sun,81.6,56.7,665.6,1.9,21.9,71,5.8,0.0,54.29
514,7,4,aug,sun,81.6,56.7,665.6,1.9,21.2,70,6.7,0.0,11.16
515,1,4,aug,sat,94.4,146.0,614.7,11.3,25.6,42,4.0,0.0,0.00


You can also read files from your hard drive in pandas by using the path to the file as the first argument of `pd.read_csv`.

Finally, you can write files from the data in a `DataFrame` using the `DataFrame.to_csv()` method. For example, here's how we would save `ff` to a new file in Jupyter notebook:

In [None]:
ff.to_csv("ff.csv")

In Colab, you can download this file to your hard drive by clicking on the "Files" button to the left.

# Anonymous feedback

If you have any feedback for us, please let us know! The feedback form is completely anonymous, and we promise we'll take your suggestions into account for future presentations: https://forms.gle/C12vK71RJK6CraZv5

# References

Throughout the quarter, we will mainly be drawing our material from the following sources. Most of your learning will be done through trial and error, so we strongly encourage you to experiment by running code that you write from scratch!

For basic Python:
* The Python Tutorial: https://docs.python.org/3/tutorial/
* Basics of Python 3: https://www.learnpython.org/
* CodeAcademy Python 3 Course: https://www.codecademy.com/learn/learn-python-3

And for the rest of the quarter:
* Introducing Data Science: http://bedford-computing.co.uk/learning/wp-content/uploads/2016/09/introducing-data-science-machine-learning-python.pdf 
* Python for Data Analysis: http://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf 
* Pandas user guide: https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html 
* Sklearn user guide: https://scikit-learn.org/stable/user_guide.html 