# Machine Learning - introductory project - based on Programming with Mosh course

>Author: **Andrzej Kocielski**  
Github: [andkoc001](https://github.com/andkoc001/)  
Email: and.koc001@gmail.com

Created: 24-12-2019

This Notebook should be read in conjunction with the corresponding README.md file at the project [repository](https://github.com/andkoc001/) at GitHub.

___

## Introduction - project background

### Problem statement and project objectives

Purpose: _to be updated..._

#### Steps of a ML project
Verbatim from _Code with Mosh_ tutorial course.

1. Import the data
2. Clean the data
3. Split the data into training / test sets
4. Create a model
5. Train the model
6. Make predictions
7. Evaluate and improve

___
## Project dataset

The data set we are going to use for this project is **video games sales** dataset available from kaggle.com website. The dataset for convenience is also available in this repository in file `vgsales.csv`.

### Python environment setup 

#### Importing Python libraries

The following Python libraries were used in synthesising the dataset variables.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import scipy.stats as stats
# below command will allow for the plots being displayed inside the Notebook, rather than in a separate screen.
%matplotlib inline

#### Load the dataset

Loading the dataset from `vgsales.csv` file. Assigning the dataset to a DataFrame named `df`.

In [2]:
df = pd.read_csv("vgsales.csv")

### Basic information about the dataset

In [3]:
df.shape

(16598, 11)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
Rank            16598 non-null int64
Name            16598 non-null object
Platform        16598 non-null object
Year            16327 non-null float64
Genre           16598 non-null object
Publisher       16540 non-null object
NA_Sales        16598 non-null float64
EU_Sales        16598 non-null float64
JP_Sales        16598 non-null float64
Other_Sales     16598 non-null float64
Global_Sales    16598 non-null float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


In [31]:
df.describe(include="all")

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
count,16598.0,16598,16598,16327.0,16598,16540,16598.0,16598.0,16598.0,16598.0,16598.0
unique,,11493,31,,12,578,,,,,
top,,Need for Speed: Most Wanted,DS,,Action,Electronic Arts,,,,,
freq,,12,2163,,3316,1351,,,,,
mean,8300.605254,,,2006.406443,,,0.264667,0.146652,0.077782,0.048063,0.537441
std,4791.853933,,,5.828981,,,0.816683,0.505351,0.309291,0.188588,1.555028
min,1.0,,,1980.0,,,0.0,0.0,0.0,0.0,0.01
25%,4151.25,,,2003.0,,,0.0,0.0,0.0,0.0,0.06
50%,8300.5,,,2007.0,,,0.08,0.02,0.0,0.01,0.17
75%,12449.75,,,2010.0,,,0.24,0.11,0.04,0.04,0.47


In [38]:
# more infor on a sample of a categorical attribute 
print(df["Genre"].describe())
print(df["Genre"].unique())

count      16598
unique        12
top       Action
freq        3316
Name: Genre, dtype: object
['Sports' 'Platform' 'Racing' 'Role-Playing' 'Puzzle' 'Misc' 'Shooter'
 'Simulation' 'Action' 'Fighting' 'Adventure' 'Strategy']


In [6]:
df.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [7]:
df.values

array([[1, 'Wii Sports', 'Wii', ..., 3.77, 8.46, 82.74],
       [2, 'Super Mario Bros.', 'NES', ..., 6.81, 0.77, 40.24],
       [3, 'Mario Kart Wii', 'Wii', ..., 3.79, 3.31, 35.82],
       ...,
       [16598, 'SCORE International Baja 1000: The Official Game', 'PS2',
        ..., 0.0, 0.0, 0.01],
       [16599, 'Know How 2', 'DS', ..., 0.0, 0.0, 0.01],
       [16600, 'Spirits & Spells', 'GBA', ..., 0.0, 0.0, 0.01]],
      dtype=object)

### Project briefs

First, I am going to simplify the dataset by removing some of the columns, which I consider redundant for the project completion. These columns are: Name, NA_Sales, EU_Sales, JP_Sales and Other_Sales.

The project will be divided into two phases: 1) Data analysis, 2) Machine Learning.

1) In the first phase of the project, I would like to examine the dataset, taking into account several of its attributes: Platform, Year, Genre and Publisher. I would like to see if there is a relationship between any of these attributes and the Global_Sales. Regression, Classification.

2) In the second phase of the project, I am going to build a model to predict the Global_Sales (output variable, aka target), based on the input variables. The input dataset is built from the same attributes as in phase #1, without the target.

___
## References and bibliography 

### Project related

- Hamedani, M., The Complete Python Course [online] Available at: <https://codewithmosh.com/courses/> [Accessed December 2019]


### Numerical tools

- SciPy - Reference Guide. [online] Available at: <https://docs.scipy.org/doc/scipy/reference/> [Accessed December 2019].
- NumPy - Documentation. [online] Available at: <https://numpy.org/doc/> [Accessed December 2019].
- Pandas - Documentation. [online] Available at: <https://pandas.pydata.org/pandas-docs/stable/> [Accessed November 2019].
- Random sampling (numpy.random) - NumPy v1.16 Manual. [online] Available at: <https://docs.scipy.org/doc/numpy-1.16.0/reference/routines.random.html> [Accessed November 2019].
- A truncated normal continuous random variable (scipy.stats.truncnorm) - SciPy v1.3.3 Reference Guide. [online] Available at: <https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.truncnorm.html> [Accessed December 2019].

___
Andrzej Kocielski