# Task -  IPL Powerplay score prediction

<b> Rutvik Solanki </b> 

<b> Yagnik Bhargav </b>

<b> Zainuddin Saiyed </b>

## Objective :

Our aim is to predict the the total runs in a power play of an cricket match such as IPL, based on a set of given features such as cricketers and bowlers - name, their performance and statistics in previous games, and other features.


### Dataset - IPL  runs dataset 
The first row of each CSV file contains the headers for the file, with each
subsequent row providing details on a single delivery. The headers in the
file are:

  * match_id
  * season
  * start_date
  * venue
  * innings
  * ball
  * batting_team
  * bowling_team
  * striker
  * non_striker
  * bowler
  * runs_off_bat
  * extras
  * wides
  * noballs
  * byes
  * legbyes
  * penalty
  * wicket_type
  * player_dismissed
  * other_wicket_type
  * other_player_dismissed

#### Loading the necessary packages

In [9]:
import pymongo
import pandas as pd
import json
import joblib
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error

<b>PyMongo</b> is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python

In the below code section we try to fetch the required and necessary bowler and batsmen information for the corresponding power game play that is being played in the cricket match. Also we perform use padding that add extra zeros in cases where the data is less

In [10]:
def pad(A, length):
    arr = np.zeros(length)
    arr[:len(A)] = A
    return arr


def get_bowl_detail(d,df):
    m = []
    for i, x in enumerate(d):
        l = np.array([])
        bowlers = x.split(',')
        for i in bowlers:
            l = np.append(l, df[df['bowler'] == i][['noballs', 'total_runs_mean', 'total_runs', 'wides']].to_numpy())
            # l=np.append(l,df[df['bowler']==i][['total_runs_mean']].to_numpy())
        l = pad(l, 44)
        m.append(l)
    return np.array(m)


def get_bat_detail(d,df):
    m = []
    for i, x in enumerate(d):
        l = np.array([])
        bowlers = x.split(',')
        for i in bowlers:
            l = np.append(l, df[df['striker'] == i][['four', 'six', 'total_runs_mean', 'total_runs']].to_numpy())
            # l=np.append(l,df[df['striker']==i][['total_runs_mean']].to_numpy())
        l = pad(l, 44)
        m.append(l)
    return np.array(m)

### Train model
We use a linear regression model to train the IPL dataset where we use features like venue , innings, batting team, bowling team, wickets, bowlers, bowler stats and batsmen statistics.

The data is divided into train and test samples with a split of 75% , 25% i.e. 75% data goes into training while 25% data is used for testing.

Later we find the Mean squared error (MSE) to evaluate the model performance.

In [11]:
def train_model(data,bat,bowl):
    venue_encoder = LabelEncoder()
    team_encoder = LabelEncoder()
    data['venue'] = venue_encoder.fit_transform(data['venue'])
    data['batting_team'] = team_encoder.fit_transform(data['batting_team'])
    data['bowling_team'] = team_encoder.fit_transform(data['bowling_team'])
    data = data[['venue', 'innings', 'batting_team', 'bowling_team', 'wickets', 'bowlers', 'bow', 'bat','total_runs']]
    a = data.to_numpy()
    X, y = a[:, :5], a[:, 8]
    ba = get_bat_detail(data['bat'],bat)
    bo = get_bowl_detail(data['bow'],bowl)
    X = np.concatenate((np.eye(42)[a[:, 0].astype(int)],
                        np.eye(2)[a[:, 1].astype(int) - 1],
                        np.eye(15)[a[:, 2].astype(int)],
                        np.eye(15)[a[:, 3].astype(int)],
                        a[:, 4].astype(int).reshape(-1, 1),
                        a[:, 5].astype(int).reshape(-1, 1),
                        ba, bo
                        ), axis=1)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True, random_state=42)
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    print('Model trained successfully')
    y_pred = lr.predict(X_test)
    mse=mean_squared_error(y_test, np.round(y_pred))
    print("Mean squared error is ",mse)
    print('Saving model')
    joblib.dump(lr, 'lr.joblib')
    print('Model saved')
    print('Saving encoder representations')
    joblib.dump(venue_encoder, 'venue_encoder.joblib')
    joblib.dump(team_encoder, 'team_encoder.joblib')
    print('Saved encoder representations')
    return 0


We save the trained model using joblib package in python . We also save the Label encoder object on file so that can be use directly for prediction 

### Predict on the model
We firslty load the model from the files and then perform some computation (Label encoder, load  batsman stats and bowler stats) and then use the model.predict() method to predict on the model.

In [12]:
def predictRuns(testInput,bat,bowl):
    prediction = 0
    with open('lr.joblib', 'rb') as f:
        lr = joblib.load(f)
    with open('venue_encoder.joblib', 'rb') as f:
        venue_encoder = joblib.load(f)
    with open('team_encoder.joblib', 'rb') as f:
        team_encoder = joblib.load(f)
    data=testInput
    data['venue'] = venue_encoder.fit_transform(data['venue'])
    data['batting_team'] = team_encoder.fit_transform(data['batting_team'])
    data['bowling_team'] = team_encoder.fit_transform(data['bowling_team'])
    bats=data['batsmen'].values[0].split(',')
    bowls = data['bowlers'].values[0].split(',')
    data['wickets']=len(bats)
    data['bowls'] = len(bowls)
    data=data[['venue','innings','batting_team','bowling_team','wickets','bowls','batsmen','bowlers']]
    a = data.to_numpy()
    ba = get_bat_detail(data['batsmen'],bat)
    bo = get_bowl_detail(data['bowlers'],bowl)
    X = np.concatenate((np.eye(42)[a[:, 0].astype(int)],
                        np.eye(2)[a[:, 1].astype(int) - 1],
                        np.eye(15)[a[:, 2].astype(int)],
                        np.eye(15)[a[:, 3].astype(int)],
                        a[:, 4].astype(int).reshape(-1, 1),
                        a[:, 5].astype(int).reshape(-1, 1),
                        ba, bo
                        ), axis=1)
    pred=lr.predict(X)
    prediction=np.round(pred)
    return int(prediction[0])

### Create Mongo DB Database

In [13]:
def create_database():
    myclient = pymongo.MongoClient("mongodb+srv://username:password@cluster0.t1s75.mongodb.net/ipl?retryWrites=true&w=majority")
    mydb = myclient["ipl"]
    return mydb

### Convert csv to MongoDB collection

In [14]:
def convert_py_mongob(mydb,csv_path,col_name):
    col=mydb[col_name]
    df=pd.read_csv(csv_path)
    records = json.loads(df.T.to_json()).values()
    x=col.insert_many(records)
    x=col.find_one()
    print('Dataset inserted in mongodb database as collection')
    print(x)

### Store dataset on MongoDB cluster

In [18]:

def create_db(mydb):
    convert_py_mongob(mydb,'processed.csv','all_ipl')
    convert_py_mongob(mydb, 'batsman.csv', 'batsman')
    convert_py_mongob(mydb, 'bowler.csv', 'bowler')
    convert_py_mongob(mydb, 'test/sampleinput_files/Apr-29-inn1-match1.csv', 'test1')
    convert_py_mongob(mydb, 'test/sampleinput_files/Apr-29-inn2-match1.csv', 'test2')
    convert_py_mongob(mydb, 'test/sampleinput_files/Apr-29-inn1-match2.csv', 'test3')
    convert_py_mongob(mydb, 'test/sampleinput_files/Apr-29-inn2-match2.csv', 'test4')
    convert_py_mongob(mydb, 'test/sampleinput_files/Apr-30-inn1.csv', 'test5')

### Load MongoDB collection to dataframe

In [19]:
def load_dataset(mydb,col_name):
    df=pd.DataFrame(list(mydb[col_name].find()))
    return df

Create the database and load the dataset from MongoDB

In [20]:
mydb=create_database()
if (len(mydb.list_collection_names())==0):
    create_db(mydb)
data=load_dataset(mydb,'all_ipl')
print('Loaded ipl dataset')
bat=load_dataset(mydb,'batsman')
print('Loaded batsman stats dataset')
bowl=load_dataset(mydb,'bowler')
print('Loaded bowler stats dataset')

Loaded ipl dataset
Loaded batsman stats dataset
Loaded bowler stats dataset


### Printing the dataset values

In [21]:
print(data.head())

                        _id                                       venue  \
0  608a7103e7d4d50d4d11bc64                       M Chinnaswamy Stadium   
1  608a7103e7d4d50d4d11bc65                       M Chinnaswamy Stadium   
2  608a7103e7d4d50d4d11bc66  Punjab Cricket Association Stadium, Mohali   
3  608a7103e7d4d50d4d11bc67  Punjab Cricket Association Stadium, Mohali   
4  608a7103e7d4d50d4d11bc68                            Feroz Shah Kotla   

   innings                 batting_team                 bowling_team  wickets  \
0        1        Kolkata Knight Riders  Royal Challengers Bangalore        3   
1        2  Royal Challengers Bangalore        Kolkata Knight Riders        6   
2        1          Chennai Super Kings              Kings XI Punjab        3   
3        2              Kings XI Punjab          Chennai Super Kings        2   
4        1             Rajasthan Royals             Delhi Daredevils        4   

   bowlers                              bow  \
0        3     

In [22]:
print(bat.head())

                        _id         striker  four  six  total_runs_mean  \
0  608a7103e7d4d50d4d11c2c8  A Ashish Reddy     1    0         1.000000   
1  608a7103e7d4d50d4d11c2c9        A Chopra     3    0         0.657895   
2  608a7103e7d4d50d4d11c2ca      A Flintoff     2    1         1.500000   
3  608a7103e7d4d50d4d11c2cb        A Mishra     0    0         0.200000   
4  608a7103e7d4d50d4d11c2cc        A Mukund     1    0         0.937500   

   total_runs  
0           6  
1          25  
2          21  
3           1  
4          15  


## Main Function to predict and test the model :

In [24]:
res='y'
while res.lower()=='y':
    ques=input('Do you want to train model or predict on the trained model ? Type(train(t)/predict(p))')
    if ques.lower()=='t':
        x=train_model(data,bat,bowl)
        if x==0:
            print('Model training Successful')
    else :
        t=input('Enter test (1/2/3/4/5)')
        test = load_dataset(mydb, t)
        runs=predictRuns(test,bat,bowl)
        print('Predicted Runs ',runs)
    res=input('Do you want to try again (y/n)')

Do you want to train model or predict on the trained model ? Type(train(t)/predict(p))t
Model trained successfully
Mean squared error is  171.32273838630806
Saving model
Model saved
Saving encoder representations
Saved encoder representations
Model training Successful
Do you want to try again (y/n)y
Do you want to train model or predict on the trained model ? Type(train(t)/predict(p))p
Enter test (1/2/3/4/5)test1
Predicted Runs  30
Do you want to try again (y/n)y
Do you want to train model or predict on the trained model ? Type(train(t)/predict(p))p
Enter test (1/2/3/4/5)test2
Predicted Runs  49
Do you want to try again (y/n)y
Do you want to train model or predict on the trained model ? Type(train(t)/predict(p))p
Enter test (1/2/3/4/5)test3
Predicted Runs  38
Do you want to try again (y/n)n


## Conclusion:

We learned how to access the mongo DB , upload and load the data set to and from mongo DB cloud server. We also learned and performed data processing and loading after fetching from mongo db server in python and also trained a linear regression and to perform regression to predict the runs scored by the cricket team based on the batsmen, bowler and match stats and performance.