### Importing Standard Libraries

In [None]:
import numpy as np  # Library for array processing , Linear algebra
import pandas as pd  # Library for data processing, data manipulation
import matplotlib.pyplot as plt  # Library for data visualisation
import seaborn as sns  # Library for different plots

from sklearn.model_selection import train_test_split  # To split data into training and validation data
from sklearn.metrics import mean_squared_error  # Evaluation metric

sns.set(style="whitegrid", color_codes=True) 
sns.set(font_scale=1)

from IPython.display import display 
pd.options.display.max_columns = None  # To display all columns in the notebook

# Displaying graphs in the notebook itself
%matplotlib inline 

import warnings
warnings.filterwarnings('ignore')  # Doesn't display warnings

### Loading Dataset

In [None]:
# Read and store the data in a dataframe 'data' to be used for furthur processing (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

In [None]:
# Display first five rows of the dataset (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

In [None]:
# Similarily data.tail() shows last five rows of the data

# Display the last five rows of the data (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

In [None]:
# Dimensions of the data (Number of rows, Number of columns) (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

In [None]:
# Print all the columns/features in the data (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

#### Length of the dataset

In [None]:
# Length of dataset (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

## Understanding Pandas DataFrame

In [None]:
# Access the column player_survive_time (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

In [None]:
# Access multiple columns, party_size and player_kills (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

In [None]:
# Access multiple rows, index 3 to 5 (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

## Dealing with the 'date' feature

In [None]:
# To change the date format
data['date'] =  pd.to_datetime(data['date'], format='%Y-%m-%dT%H:%M:%S+0000')

In [None]:
# Extracting the weekday from date
data['Day'] = pd.DatetimeIndex(data['date']).weekday
weekday_map = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}

In [None]:
# Extracting hour from time
# Creating new variable hour from the time variable 
data['Hour'] = pd.DatetimeIndex(data['date']).hour

In [None]:
# Display first three rows of the data
data.head(3)

### Getting Rid of Redundant Variables

In [None]:
del data['date']  # As we have already extracted the useful info i.e. Weekday and Hour
del data['match_mode']  # Because all the matches were played in TPP (Third-Person Perspective) mode
del data['team_id']  # Because we already have match_id and player_name to uniquely identify an instance

## Steps
*  Problem Identification 
*  Hypothesis Generation
*  Variable Identification
*  Univariate Analysis
*  Bivariate Analysis
*  Missing Values
*  Outliers
*  Feature Engineering/Variable Transformation
*  Predictive Modeling
*  Analysing the Model
*  Final Model Selection

## Variable Identification & their datatypes
Identify the predictor and target variables & their data types along with the category of variables

In [None]:
# Determining data types of the variable (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

#### Normally, numeric columns in python are represented as "int32", "float32", "int64", "float64". Whereas character columns are represented as "object"

## Univariate Analysis
Analysing the variables one at a time. Let's analyse coninuous and categorical variables separately.

### For Continuous Variables : We generally measure the central tendency of the variable such as Mean , Median , Mode , Std, variance ,etc.
* Basic Statistics
* Plotting Histogram
* Plotting Boxplot

In [None]:
# Continious variable analysis (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

In [None]:
# Plot given numerical variable with respect to other variables
cont_vars = ['player_dbno', 'player_dist_walk', 'player_dmg', 'player_kills']
sns.pairplot(data[cont_vars])

In [None]:
# Plotting histogram for 'player_kills' variable (1 line of code)

### START CODE HERE ###

    ### plot(data, arguments)
    
### END CODE HERE ###

plt.title("Distribution of Number of Kills")
plt.ylabel("Number of Occurences")
plt.xlabel("Number of Kills");

In [None]:
# Frequency of each value in weekday column
weekday_map = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
dict(data.Day.value_counts())

In [None]:
# Plotting histogram for 'Day' variable
week_data = {'Mon': 14155, 'Tue': 13860, 'Wed': 13183, 'Thu': 11611, 'Fri': 14458, 'Sat': 16443, 'Sun': 16290}
names = list(week_data.keys())
values = list(week_data.values())

fig, axs = plt.subplots(1, 2, figsize=(12, 5), sharey=True)
axs[0].bar(names, values)
axs[1].plot(names, values)
fig.suptitle('Categorical Plotting')

In [None]:
# Boxplot of the variable game_size (1 line of code)

### START CODE HERE ###
    
    ###plot(data, arguments)
    
### END CODE HERE ###

plt.title("Distribution of game_size")
plt.xlabel("Number of Teams in Game");

### For categorical variables: We generally measure the frequency of categories appearing in a particular categorical variable
* Count/Frequency Table
* Plotting Stacked Bar Graph

In [None]:
# Selecting categorical variables from the data

categorical_variables = ['party_size', 'Day', 'Hour']

In [None]:
print(categorical_variables)

In [None]:
# Unique values count in each categorical variable (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

In [None]:
# Frequency count of each categorical variable (4 line of code)

### START CODE HERE ###




### END CODE HERE ###

In [None]:
# Display party_size's distribution using pie-chart

labels = data['party_size'].unique()
sizes = data['party_size'].value_counts().values
explode=[0.1,0,0]
parcent = 100.*sizes/sizes.sum()
labels = ['{0} - {1:1.1f} %'.format(i,j) for i,j in zip(labels, parcent)]

colors = ['yellowgreen', 'gold', 'lightblue']
patches, texts= plt.pie(sizes, colors=colors,explode=explode,
                        shadow=True,startangle=90)
plt.legend(patches, labels, loc="best")

plt.title("Party Size Classification")
plt.show()

## Bivariate Analysis
Bivariate analysis is used to find out the relationship between any 2 variables. It can be done for any combination of variables. The combinations are: 
* Continuous & Continuous
* Categorical & Continuous
* Categorical & Categorical

### Continuous & Continuous
Scatter Plots are used

In [None]:
### START CODE HERE ###

# Scatterplot between Hitpoints and DBNO's(1 line of code)

# Display title above the plot (1 line of code)

# Label y-axis (1 line of code)

# Label x-axis (1 line of code)

### END CODE HERE ###

In [None]:
# Correlation heatmap between variables 

corrMatrix = data[["game_size", "player_assists", "player_dbno",
                   "player_dist_ride", "player_dist_walk", "player_dmg",
                   "player_survive_time", "team_placement", "player_kills"]].corr()

sns.set(font_scale=1.10)
plt.figure(figsize=(9, 9))

sns.heatmap(corrMatrix, vmax=.8, linewidths=0.01,
            square=True,annot=True,cmap='viridis',linecolor="white")
plt.title('Correlation between features');

#### +1 : perfect postive correlation ; -1 : perfect negative correlation ; 0 : No correlation

### Categorical & Continuous
Boxplots can be used

In [None]:
# sns.boxplot between Team Size and Survival Time (4 line of code)

### START CODE HERE ###




### END CODE HERE ###

### Categorical and categorical
Crosstable and stacked bar plots are used

In [None]:
crosstable = pd.crosstab(data.Day, data.party_size)

In [None]:
crosstable

In [None]:
# Plotting stacked bar plot (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

## Missing Values

In [None]:
# Detecting missing values (1 line of code)

### START CODE HERE ###

### END CODE HERE ###


### Treating missing values:
* For continuous variables impute with mean
* For categorical variables impute with mode
* For better results predict missing values in a variable by considering it target variable
* If missing values are less then we can delete the observations having missing values


## Outliers
Outliers are the data points showing out of the box behaviour or that appears far away from the overall trend.

In [None]:
# Boxplot of the variable survival time
sns.boxplot("player_survive_time", data=data, showfliers=True)
plt.title("Distribution of Survival Time")
plt.xlabel("Survival Time");

In [None]:
#Treating outliers (5 line of code)

### START CODE HERE ###





### END CODE HERE ###

In [None]:
# Range lower_value and upper_value
lower_value, upper_value

In [None]:
# Replacing outlier with meadian value the data if it is outside the above given range
# Note: Take care of the indentation

def outlier_imputer(x):  # (4 lines of code)
    """
    
    
    
    
    """

In [None]:
result = data['player_survive_time'].apply(outlier_imputer)  # This would take a lil bit time to run

In [None]:
# Draw a labeled boxplot of variable "result" (3 line of code)

### START CODE HERE ###



### END CODE HERE ###

# Building the First Model

#### After tightening seat-belt its time to takeoff

In [None]:
# Depenent_variable -> which we are going to predict
# Independent_variable -> helps in predicting dependent_variable

dependent_variable = 'player_kills'
independent_variable = ['game_size', 'party_size', 'player_assists', 'player_dbno', 'player_dist_ride', 'Hour', 
                        'player_dist_walk', 'player_dmg', 'player_survive_time', 'team_placement', 'Day']

In [None]:
independent_variable

###  Splitting our data into training and testing(validation) data

In [None]:
# To split data into training and testing sets
from sklearn.model_selection import train_test_split

In [None]:
# Split the data into training and testing sets (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

In [None]:
train.head()

In [None]:
print(len(data))
print(len(train))
print(len(test))

In [None]:
# Predicting by using mode
np.round(train['player_kills'].mean())  # train['player_kills'].mean() = 0.887

In [None]:
test['prediction'] = 1.0

In [None]:
test.head()

In [None]:
# Analysing the prediction
from sklearn.metrics import mean_squared_error

In [None]:
RMSE = np.sqrt(mean_squared_error(test['prediction'], test[dependent_variable]))
np.round(RMSE)  # RMSE = 1.616

# Building Machine Learning Model

### Using Linear Regression Algorithm

In [None]:
# Importing machine learning library
from sklearn.linear_model import LinearRegression

In [None]:
# Creating machine learning model
model1 = LinearRegression()

In [None]:
# Training our model
model1.fit(train[independent_variable], train[dependent_variable])

In [None]:
# Get coeffecients
model1.coef_

In [None]:
# Get intercept
model1.intercept_

In [None]:
# Predicting on test data
prediction = model1.predict(test[independent_variable])

#### Analysing our model

In [None]:
# Accuracy on training dataset
np.sqrt(mean_squared_error(model1.predict(train[independent_variable]), train[dependent_variable]))

In [None]:
# Accuracy on testing dataset
np.sqrt(mean_squared_error(model1.predict(test[independent_variable]), test[dependent_variable]))

### Using Decision Tree Algorithm

In [None]:
# Importing Decision Tree Classifier
from sklearn.tree import DecisionTreeRegressor

In [None]:
model2 = DecisionTreeRegressor()

In [None]:
# Training our decision tree model (model2) (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

In [None]:
# Get Predictions for model2 (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

In [None]:
# RMSE on testing dataset (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

### Using LightGBM

In [None]:
# !pip install lightgbm

In [None]:
# Importing LightGBM Regressor
import lightgbm
from lightgbm import LGBMRegressor

In [None]:
model3 = LGBMRegressor()

In [None]:
# Training our LightGBM model (model3) (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

In [None]:
ax = lightgbm.plot_importance(model3)
fig = ax.figure
fig.set_size_inches(8, 8)

In [None]:
# Get Predictions for model3 (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

In [None]:
# RMSE on testing dataset (1 line of code)

### START CODE HERE ###

### END CODE HERE ###

## What You Can Try Next on Your Own

We saw that LightGBM outperformed Linear Regression and Decision Trees by a little margin and clearly surpassed our baseline model by a huge amount. However, few more things can be tried to push RMSE:

* HyperParameter Tuning using Hyperopt etc.
* Better feature generation.
* Trying ensembles of different models.
* Better feature transformations.

## Where to Go from Here

Here are some resources and blogs that would help one to get started in Data Science and Machine Learning:

* __[DSG Blog about How to Start Data Science](https://medium.com/data-science-group-iitr/stop-thinking-start-learning-cb74629bca3a)__
* __[DSG Medium Handle](https://medium.com/data-science-group-iitr)__
* __[3 Blue 1 Brown](https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw)__
* __[Harvard Data Science Course (CS109)](http://cs109.github.io/2015/pages/videos.html)__
* __[Andrew Ng Machine Learning Course](http://cs229.stanford.edu/)__
* __[Analytics Vidhya](https://www.analyticsvidhya.com/blog/)__
* __[Machine Learning Mastery](https://machinelearningmastery.com/)__
* __[Kaggle (A Competitive Data Science Platform)](https://www.kaggle.com/)__