# T20i result prediction from first innings

Based on exploring the ball by ball data available from https://cricsheet.org/. The downloaded folder containing all T20i as of 18th May 2024 is saved in the same folder as this notebook.

## Contents
<a id='contents'></a>
- [Setup](#setup)
- [Data exploration](#explore)
- [Classification](#class)
- [Review](#review)

<a id='setup'></a>
## Setup
[Return to contents](#contents)
### Import packages:

In [1]:
import json
import numpy as np
import pandas as pd
import collections

# Models
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import ParameterGrid
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm



### Train-Validate-Test split:
Want to split based on time, since older matches should be used to predict for newer, precluding cross validation. The information about the various matches is contained in the accompanying README file.

In [2]:
records = np.loadtxt('t20s_json/README.txt', delimiter='-', skiprows=24, 
                     dtype={'names': ('year', 'month', 'day', 'level', 'type', 'gender', 'ident', 'teams'),
                            'formats': ('f4', 'f2', 'f2', 'U14', 'U4', 'U7', 'f8', 'U32')})


In [3]:
year_counts = collections.Counter([records[i][0] for i in range(len(records))])
print(year_counts)

Counter({2022.0: 673, 2023.0: 616, 2021.0: 398, 2019.0: 357, 2024.0: 305, 2018.0: 194, 2016.0: 133, 2020.0: 124, 2012.0: 109, 2014.0: 102, 2017.0: 73, 2015.0: 72, 2013.0: 69, 2010.0: 65, 2009.0: 52, 2007.0: 36, 2011.0: 23, 2008.0: 15, 2006.0: 8, 2005.0: 3})


In [4]:
print('2005-2022:', sum([year_counts[year] for year in range(2005,2023)]), sum([year_counts[year] for year in range(2005,2023)])/len(records))
print('2023:', sum([year_counts[year] for year in range(2023,2024)]), sum([year_counts[year] for year in range(2023,2024)])/len(records))
print('2024:', sum([year_counts[year] for year in range(2024,2025)]), sum([year_counts[year] for year in range(2024,2025)])/len(records))

2005-2022: 2506 0.7312518237525533
2023: 616 0.1797490516486723
2024: 305 0.08899912459877443


Looking only at the years, this split gives the closest proportions to 70-20-10.

In [5]:
train_idents = [str(int(records[i]['ident'])) for i in range(len(records)) if records[i]['year'] in range(2005,2023)]
valid_idents = [str(int(records[i]['ident'])) for i in range(len(records)) if records[i]['year'] == 2023]
test_idents = [str(int(records[i]['ident'])) for i in range(len(records)) if records[i]['year'] == 2024]

<a id='explore'></a>
## Data exploration
[Return to contents](#contents)
### Examine a single match:
The first stage is to explore a JSON covering a single match, to understand what data is available and how it might be used further.

In [6]:
with open('t20s_json/211028.json') as json_data: # Randomly selected match - England vs Australia in 2013
    match1 = json.load(json_data)

In [7]:
for key in match1:
    print('*****')
    print(key)
    for key2 in match1[key]:
        print('*',key2)

*****
meta
* data_version
* created
* revision
*****
info
* balls_per_over
* city
* dates
* gender
* match_type
* match_type_number
* officials
* outcome
* overs
* player_of_match
* players
* registry
* season
* team_type
* teams
* toss
* venue
*****
innings
* {'team': 'England', 'overs': [{'over': 0, 'deliveries': [{'batter': 'ME Trescothick', 'bowler': 'B Lee', 'non_striker': 'GO Jones', 'runs': {'batter': 0, 'extras': 0, 'total': 0}}, {'batter': 'ME Trescothick', 'bowler': 'B Lee', 'non_striker': 'GO Jones', 'runs': {'batter': 1, 'extras': 0, 'total': 1}}, {'batter': 'GO Jones', 'bowler': 'B Lee', 'non_striker': 'ME Trescothick', 'runs': {'batter': 0, 'extras': 0, 'total': 0}}, {'batter': 'GO Jones', 'bowler': 'B Lee', 'non_striker': 'ME Trescothick', 'runs': {'batter': 0, 'extras': 0, 'total': 0}}, {'batter': 'GO Jones', 'bowler': 'B Lee', 'non_striker': 'ME Trescothick', 'runs': {'batter': 0, 'extras': 0, 'total': 0}}, {'batter': 'GO Jones', 'bowler': 'B Lee', 'extras': {'noballs'

So the file contains the following information:
* meta - with information about the save file version
* info - for general information about the match
* innings - information about the actual events of the match, although this data seems to be stored in a more complex way

In [8]:
match1['info']

{'balls_per_over': 6,
 'city': 'Southampton',
 'dates': ['2005-06-13'],
 'gender': 'male',
 'match_type': 'T20',
 'match_type_number': 2,
 'officials': {'match_referees': ['JJ Crowe'],
  'tv_umpires': ['MR Benson'],
  'umpires': ['NJ Llong', 'JW Lloyds']},
 'outcome': {'by': {'runs': 100}, 'winner': 'England'},
 'overs': 20,
 'player_of_match': ['KP Pietersen'],
 'players': {'Australia': ['AC Gilchrist',
   'ML Hayden',
   'A Symonds',
   'MJ Clarke',
   'MEK Hussey',
   'RT Ponting',
   'DR Martyn',
   'B Lee',
   'JN Gillespie',
   'MS Kasprowicz',
   'GD McGrath'],
  'England': ['ME Trescothick',
   'GO Jones',
   'A Flintoff',
   'KP Pietersen',
   'MP Vaughan',
   'PD Collingwood',
   'AJ Strauss',
   'VS Solanki',
   'J Lewis',
   'D Gough',
   'SJ Harmison']},
 'registry': {'people': {'A Flintoff': 'ddc0828d',
   'A Symonds': 'bd77eb62',
   'AC Gilchrist': '2b6e6dec',
   'AJ Strauss': 'b68d14a9',
   'B Lee': 'dd09ff8e',
   'D Gough': 'fcbf5a30',
   'DR Martyn': '69762509',
   'G

In [9]:
innings = match1['innings']
innings

[{'team': 'England',
  'overs': [{'over': 0,
    'deliveries': [{'batter': 'ME Trescothick',
      'bowler': 'B Lee',
      'non_striker': 'GO Jones',
      'runs': {'batter': 0, 'extras': 0, 'total': 0}},
     {'batter': 'ME Trescothick',
      'bowler': 'B Lee',
      'non_striker': 'GO Jones',
      'runs': {'batter': 1, 'extras': 0, 'total': 1}},
     {'batter': 'GO Jones',
      'bowler': 'B Lee',
      'non_striker': 'ME Trescothick',
      'runs': {'batter': 0, 'extras': 0, 'total': 0}},
     {'batter': 'GO Jones',
      'bowler': 'B Lee',
      'non_striker': 'ME Trescothick',
      'runs': {'batter': 0, 'extras': 0, 'total': 0}},
     {'batter': 'GO Jones',
      'bowler': 'B Lee',
      'non_striker': 'ME Trescothick',
      'runs': {'batter': 0, 'extras': 0, 'total': 0}},
     {'batter': 'GO Jones',
      'bowler': 'B Lee',
      'extras': {'noballs': 1},
      'non_striker': 'ME Trescothick',
      'runs': {'batter': 0, 'extras': 1, 'total': 1}},
     {'batter': 'GO Jones',

The format of the innings data is complex, with many nested layers of dictionaries and lists. It will be complex to pull out information, but it is stored very systematically.

### Functions for feature generation

Blindly feeding all of this data into a model would create something with a huge number of parameters, that would be difficult to evaluate and explore. Instead, I will look to create some functions that generate specific features.

#### Cumulative wickets
As a team loses wickets, it is generally accepted that it impacts many other factors including the scoring rate. Looking at the pattern in the first innings might give some indication of the result.

In [10]:
def cumul_wickets(match_data, innings):
    output = {}
    count = 0
    max_over = 0
    for over in match_data['innings'][innings-1]['overs']:
        for delivery in over['deliveries']:
            if 'wickets' in delivery:
                count += 1
        output[str(over['over'])] = count
        max_over = over['over']
    if max_over < 19:
        for extra_over in range(max_over,20):
            output[str(extra_over)] = 0
    return output           

In [11]:
cumul_wickets(match1, 1)

{'0': 0,
 '1': 0,
 '2': 0,
 '3': 1,
 '4': 1,
 '5': 2,
 '6': 2,
 '7': 2,
 '8': 2,
 '9': 2,
 '10': 3,
 '11': 4,
 '12': 4,
 '13': 5,
 '14': 5,
 '15': 5,
 '16': 5,
 '17': 5,
 '18': 6,
 '19': 8}

#### Target set
The innings data also includes the final number of runs posted in the first innings, and the number of overs that are available to chase it. This seems very likely to be helpful in predicting the result.

In [12]:
def target_info(match_data):
    try:
        return match_data['innings'][1]['target']
    except (IndexError, KeyError):
        return {'overs': 0, 'runs': 0}

In [13]:
target_info(match1)

{'overs': 20, 'runs': 180}

#### General info
Gathering some general information about the location, date and teams.

In [14]:
def match_info(match_data):
    try:
        city = match_data['info']['city']
    except KeyError:
        city = ''
    return {'city': city,
            'gender': match_data['info']['gender'],
           'month': int(match_data['info']['dates'][0][5:7]),
           'first_team': match_data['innings'][0]['team'],
           'second_team': [i for i in match_data['info']['teams'] if i!=match_data['innings'][0]['team']][0],
           'venue': match_data['info']['venue']}

In [15]:
match_info(match1)

{'city': 'Southampton',
 'gender': 'male',
 'month': 6,
 'first_team': 'England',
 'second_team': 'Australia',
 'venue': 'The Rose Bowl'}

<a id='class'></a>
## Classification
[Return to contents](#contents)

Now to test some models on classifying the result of the match using the generated features. There are a large range of methods that could be used, and three are implemented here:
- [Decision tree](#dt)
- [Random Forest](#rf)
- [SVM](#svm)

### Data preparation
The first stage is to compile the data in a suitable format.

In [16]:
def get_winner(match_data):
    """Get correct label"""
    if 'winner' in match_data['info']['outcome']:
        winner = match_data['info']['outcome']['winner']
        if winner == match_data['innings'][0]['team']:
            return 1 # Team batting first won
        else:
            return 2 # Team batting second won
    else:
        return 0 # No winner

def load_match(match_id):
    """Function to load in and organise the data for a single match"""
    # Load data
    with open(f't20s_json/{match_id}.json') as json_data:
        match_data = json.load(json_data)

    # Get features
    features = {**cumul_wickets(match_data, 1), **target_info(match_data), **match_info(match_data)}

    # Get label
    features['target'] = get_winner(match_data)

    return features

def data_load(id_list):
    """Load all data for the match ids in the list"""
    features = []
    for match_id in id_list:
        new_row = load_match(match_id)
        features.append(new_row)
    output = pd.DataFrame(features)
    return output.drop(columns='target'), output['target']

In [17]:
train_data, train_labels = data_load(train_idents)
valid_data, valid_labels = data_load(valid_idents)
test_data, test_labels = data_load(test_idents)

In [18]:
train_data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,overs,runs,city,gender,month,first_team,second_team,venue
0,1,1,2,4,4,5,5,5,5,6,...,9,10,20.0,109,Kigali City,male,12,Uganda,Rwanda,"Gahanga International Cricket Stadium, Rwanda"
1,1,1,1,1,1,1,2,2,2,2,...,0,0,18.0,165,Bangi,male,12,Singapore,Qatar,"UKM-YSD Cricket Oval, Bangi"
2,0,0,0,0,0,1,1,2,2,3,...,8,9,20.0,154,Bangi,male,12,Malaysia,Bahrain,"UKM-YSD Cricket Oval, Bangi"
3,0,1,2,3,3,4,4,5,5,6,...,0,0,20.0,44,Bridgetown,female,12,West Indies,England,"Kensington Oval, Bridgetown, Barbados"
4,0,1,1,2,3,3,3,3,3,3,...,8,9,20.0,137,Kigali City,male,12,Tanzania,Rwanda,"Gahanga International Cricket Stadium, Rwanda"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2501,0,0,0,1,1,1,1,2,3,4,...,6,7,20.0,127,Auckland,male,2,West Indies,New Zealand,Eden Park
2502,0,0,0,0,0,0,1,1,1,1,...,2,3,20.0,210,Brisbane,male,1,Australia,South Africa,"Brisbane Cricket Ground, Woolloongabba"
2503,0,0,0,0,0,0,0,0,0,1,...,9,10,20.0,134,Johannesburg,male,10,South Africa,New Zealand,New Wanderers Stadium
2504,0,0,0,1,1,2,2,2,2,2,...,6,8,20.0,180,Southampton,male,6,England,Australia,The Rose Bowl


<a id='dt'></a>
### Decision tree
[Return to classification](#class)

Starting off with a simple model.

#### Finalise data

There is some final formatting of the data to do, including One Hot encoding. For now, the ground, city and teams are removed since there are too many unique values.

In [19]:
encoder = preprocessing.OneHotEncoder()
encoded = pd.DataFrame(encoder.fit_transform(train_data[['gender']]).toarray(), columns=encoder.get_feature_names_out(['gender']))
train_data_small = train_data.drop(columns=['city', 'gender', 'first_team', 'second_team', 'venue'])
train_data_small = train_data_small.join(encoded)
encoded = pd.DataFrame(encoder.fit_transform(valid_data[['gender']]).toarray(), columns=encoder.get_feature_names_out(['gender']))
valid_data_small = valid_data.drop(columns=['city', 'gender', 'first_team', 'second_team', 'venue'])
valid_data_small = valid_data_small.join(encoded)
encoded = pd.DataFrame(encoder.fit_transform(test_data[['gender']]).toarray(), columns=encoder.get_feature_names_out(['gender']))
test_data_small = test_data.drop(columns=['city', 'gender', 'first_team', 'second_team', 'venue'])
test_data_small = test_data_small.join(encoded)


In [20]:
train_data_small

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,overs,runs,month,gender_female,gender_male
0,1,1,2,4,4,5,5,5,5,6,...,9,9,9,9,10,20.0,109,12,0.0,1.0
1,1,1,1,1,1,1,2,2,2,2,...,4,5,0,0,0,18.0,165,12,0.0,1.0
2,0,0,0,0,0,1,1,2,2,3,...,4,5,5,8,9,20.0,154,12,0.0,1.0
3,0,1,2,3,3,4,4,5,5,6,...,9,0,0,0,0,20.0,44,12,1.0,0.0
4,0,1,1,2,3,3,3,3,3,3,...,4,5,6,8,9,20.0,137,12,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2501,0,0,0,1,1,1,1,2,3,4,...,5,5,5,6,7,20.0,127,2,0.0,1.0
2502,0,0,0,0,0,0,1,1,1,1,...,2,2,2,2,3,20.0,210,1,0.0,1.0
2503,0,0,0,0,0,0,0,0,0,1,...,6,7,8,9,10,20.0,134,10,0.0,1.0
2504,0,0,0,1,1,2,2,2,2,2,...,5,5,5,6,8,20.0,180,6,0.0,1.0


#### Build and test initial model

In [21]:
dt = DecisionTreeClassifier()
dt.fit(train_data_small, train_labels)

In [22]:
test_predict = dt.predict(test_data_small)
print(classification_report(test_labels, test_predict))
print(confusion_matrix(test_labels, test_predict))

              precision    recall  f1-score   support

           0       0.21      0.30      0.25        10
           1       0.67      0.66      0.66       154
           2       0.65      0.65      0.65       141

    accuracy                           0.64       305
   macro avg       0.51      0.53      0.52       305
weighted avg       0.65      0.64      0.64       305

[[  3   2   5]
 [  8 101  45]
 [  3  47  91]]


So this model has moderate success at predicting when there is a victory (either 1 or 2). Perhaps this can be improved by some parameter fine-tuning.

#### Fine-tuning
Testig out some basic parameter fine-tuning, to see if there is much variation in results.

In [23]:
parameters = {
    'max_depth': range(5, 20, 5),
    'min_samples_leaf': range(2,22,5)
}

grid = ParameterGrid(parameters)

best_acc = 0
for paras in grid:
    print(paras)
    dt = DecisionTreeClassifier()
    dt.set_params(**paras)
    dt.fit(train_data_small, train_labels)
    test_predict = dt.predict(valid_data_small)
    acc = accuracy_score(valid_labels, test_predict)
    print(f'Accuracy: {acc}')
    if acc > best_acc:
        best_acc = acc
        best_params = paras

print('***********')
print(f'Best params: {best_params}')
print('On test data:')
dt = DecisionTreeClassifier()
dt.set_params(**best_params)
dt.fit(train_data_small, train_labels)
test_predict = dt.predict(test_data_small)
print(classification_report(test_labels, test_predict))
print(confusion_matrix(test_labels, test_predict))

{'max_depth': 5, 'min_samples_leaf': 2}
Accuracy: 0.7012987012987013
{'max_depth': 5, 'min_samples_leaf': 7}
Accuracy: 0.7012987012987013
{'max_depth': 5, 'min_samples_leaf': 12}
Accuracy: 0.711038961038961
{'max_depth': 5, 'min_samples_leaf': 17}
Accuracy: 0.7045454545454546
{'max_depth': 10, 'min_samples_leaf': 2}
Accuracy: 0.6558441558441559
{'max_depth': 10, 'min_samples_leaf': 7}
Accuracy: 0.6347402597402597
{'max_depth': 10, 'min_samples_leaf': 12}
Accuracy: 0.7142857142857143
{'max_depth': 10, 'min_samples_leaf': 17}
Accuracy: 0.6834415584415584
{'max_depth': 15, 'min_samples_leaf': 2}
Accuracy: 0.637987012987013
{'max_depth': 15, 'min_samples_leaf': 7}
Accuracy: 0.637987012987013
{'max_depth': 15, 'min_samples_leaf': 12}
Accuracy: 0.7126623376623377
{'max_depth': 15, 'min_samples_leaf': 17}
Accuracy: 0.6785714285714286
***********
Best params: {'max_depth': 10, 'min_samples_leaf': 12}
On test data:
              precision    recall  f1-score   support

           0       1.00  

Some improvement over the default settings, but no major changes. The performance on the validation data is better than on the test data for the final selected model, so perhaps the later time point is presenting a challenge. The size of the final suggested best model is not particularly small.

<a id='rf'></a>
### Random Forest
[Return to classification](#class)

This method is based on decision trees, but perhaps offers greater nuance to the final model. This time, fine-tuning is incorporated from the start.

In [24]:
parameters = {
    'max_depth': range(5, 20, 5),
    'min_samples_leaf': range(2,22,5),
    'n_estimators': range(3,9,2)
}

grid = ParameterGrid(parameters)

best_acc = 0
for paras in grid:
    print(paras)
    dt = RandomForestClassifier(random_state=0)
    dt.set_params(**paras)
    dt.fit(train_data_small, train_labels)
    test_predict = dt.predict(valid_data_small)
    acc = accuracy_score(valid_labels, test_predict)
    print(f'Accuracy: {acc}')
    if acc > best_acc:
        best_acc = acc
        best_params = paras

print('***********')
print(f'Best params: {best_params}')
print('On test data:')
dt = RandomForestClassifier(random_state=0)
dt.set_params(**best_params)
dt.fit(train_data_small, train_labels)
test_predict = dt.predict(test_data_small)
print(classification_report(test_labels, test_predict))
print(confusion_matrix(test_labels, test_predict))

{'max_depth': 5, 'min_samples_leaf': 2, 'n_estimators': 3}
Accuracy: 0.7288961038961039
{'max_depth': 5, 'min_samples_leaf': 2, 'n_estimators': 5}
Accuracy: 0.7353896103896104
{'max_depth': 5, 'min_samples_leaf': 2, 'n_estimators': 7}
Accuracy: 0.7272727272727273
{'max_depth': 5, 'min_samples_leaf': 7, 'n_estimators': 3}
Accuracy: 0.7094155844155844
{'max_depth': 5, 'min_samples_leaf': 7, 'n_estimators': 5}
Accuracy: 0.7126623376623377
{'max_depth': 5, 'min_samples_leaf': 7, 'n_estimators': 7}
Accuracy: 0.7142857142857143
{'max_depth': 5, 'min_samples_leaf': 12, 'n_estimators': 3}
Accuracy: 0.7045454545454546
{'max_depth': 5, 'min_samples_leaf': 12, 'n_estimators': 5}
Accuracy: 0.7272727272727273
{'max_depth': 5, 'min_samples_leaf': 12, 'n_estimators': 7}
Accuracy: 0.7224025974025974
{'max_depth': 5, 'min_samples_leaf': 17, 'n_estimators': 3}
Accuracy: 0.702922077922078
{'max_depth': 5, 'min_samples_leaf': 17, 'n_estimators': 5}
Accuracy: 0.7175324675324676
{'max_depth': 5, 'min_sample

Again there is a small improvement in accuracy, and the result on the test set is slightly worse. It is worth noting that the performance on all of the hyper-parameter variants is generally better than for a single decision tree. Each of the individual decision trees used here is 5 or fewer levels deep, rather than the 10 levels for the single tree.

<a id='svm'></a>
### SVM
[Return to classification](#class)

The same data can be used to construct a Support Vector Machine (SVM).

In [25]:
parameters = {"C": [1, 10, 100], "gamma": [0.0001, 0.001, 0.01, 0.1]}

grid = ParameterGrid(parameters)

best_acc = 0
for paras in grid:
    print(paras)
    model = svm.SVC()
    model.set_params(**paras)
    model.fit(train_data_small, train_labels)
    test_predict = model.predict(valid_data_small)
    acc = accuracy_score(valid_labels, test_predict)
    print(f'Accuracy: {acc}')
    if acc > best_acc:
        best_acc = acc
        best_params = paras

print('***********')
print(f'Best params: {best_params}')
print('On test data:')
model = svm.SVC()
model.set_params(**best_params)
model.fit(train_data_small, train_labels)
test_predict = model.predict(test_data_small)
print(classification_report(test_labels, test_predict))
print(confusion_matrix(test_labels, test_predict))

{'C': 1, 'gamma': 0.0001}
Accuracy: 0.7224025974025974
{'C': 1, 'gamma': 0.001}
Accuracy: 0.7418831168831169
{'C': 1, 'gamma': 0.01}
Accuracy: 0.7337662337662337
{'C': 1, 'gamma': 0.1}
Accuracy: 0.6737012987012987
{'C': 10, 'gamma': 0.0001}
Accuracy: 0.7337662337662337
{'C': 10, 'gamma': 0.001}
Accuracy: 0.7418831168831169
{'C': 10, 'gamma': 0.01}
Accuracy: 0.7175324675324676
{'C': 10, 'gamma': 0.1}
Accuracy: 0.6688311688311688
{'C': 100, 'gamma': 0.0001}
Accuracy: 0.75
{'C': 100, 'gamma': 0.001}
Accuracy: 0.7353896103896104
{'C': 100, 'gamma': 0.01}
Accuracy: 0.672077922077922
{'C': 100, 'gamma': 0.1}
Accuracy: 0.6688311688311688
***********
Best params: {'C': 100, 'gamma': 0.0001}
On test data:
              precision    recall  f1-score   support

           0       1.00      0.40      0.57        10
           1       0.76      0.74      0.75       154
           2       0.71      0.76      0.73       141

    accuracy                           0.74       305
   macro avg       0.8

The combination of C and gamma chosen gives the best result. It is clear from the hyperparameter tuning, that lower gamma values perform better, when a greater level of non-linearity in the decision boundary is enabled. However, generally lower C values seem consistently better. The overall accuracy here is slightly better than that seen with random forests.

<a id='review'></a>
## Review
[Return to contents](#contents)

Three different models have been applied to the data, all of which are fairly simple. The best result achieved on the test set is 74% accuracy. This is better than random selection, but could perhaps be achieved with much less data.

### Less data
Perhaps the models don't need all the data that was provided. An SVM will instead be constructed for some stripped down versions. This will consist of keeping only the number of runs to chase.

In [26]:
train_data_min = train_data_small[['runs']]
valid_data_min = valid_data_small[['runs']]
test_data_min = test_data_small[['runs']]

In [27]:
train_data_min

Unnamed: 0,runs
0,109
1,165
2,154
3,44
4,137
...,...
2501,127
2502,210
2503,134
2504,180


In [28]:
parameters = {"C": [1, 10, 100], "gamma": [0.0001, 0.001, 0.01, 0.1]}

grid = ParameterGrid(parameters)

best_acc = 0
for paras in grid:
    print(paras)
    model = svm.SVC()
    model.set_params(**paras)
    model.fit(train_data_min, train_labels)
    test_predict = model.predict(valid_data_min)
    acc = accuracy_score(valid_labels, test_predict)
    print(f'Accuracy: {acc}')
    if acc > best_acc:
        best_acc = acc
        best_params = paras

print('***********')
print(f'Best params: {best_params}')
print('On test data:')
model = svm.SVC()
model.set_params(**best_params)
model.fit(train_data_min, train_labels)
test_predict = model.predict(test_data_min)
print(classification_report(test_labels, test_predict))
print(confusion_matrix(test_labels, test_predict))

{'C': 1, 'gamma': 0.0001}
Accuracy: 0.6996753246753247
{'C': 1, 'gamma': 0.001}
Accuracy: 0.7077922077922078
{'C': 1, 'gamma': 0.01}
Accuracy: 0.7061688311688312
{'C': 1, 'gamma': 0.1}
Accuracy: 0.698051948051948
{'C': 10, 'gamma': 0.0001}
Accuracy: 0.7061688311688312
{'C': 10, 'gamma': 0.001}
Accuracy: 0.7077922077922078
{'C': 10, 'gamma': 0.01}
Accuracy: 0.702922077922078
{'C': 10, 'gamma': 0.1}
Accuracy: 0.6964285714285714
{'C': 100, 'gamma': 0.0001}
Accuracy: 0.7061688311688312
{'C': 100, 'gamma': 0.001}
Accuracy: 0.7077922077922078
{'C': 100, 'gamma': 0.01}
Accuracy: 0.7061688311688312
{'C': 100, 'gamma': 0.1}
Accuracy: 0.6801948051948052
***********
Best params: {'C': 1, 'gamma': 0.001}
On test data:
              precision    recall  f1-score   support

           0       1.00      0.40      0.57        10
           1       0.75      0.71      0.73       154
           2       0.69      0.76      0.72       141

    accuracy                           0.72       305
   macro avg

The performance for this model built from just one data point for each example is very close to that with 25. Almost everything the model is able to learn for prediction seems to come from just the number of runs to be chased. Interestingly, here the performance on the test data is better than on the validation set.

### Overview
The most important information identified here is the number of runs for a team to chase. Perhaps adding in data about the teams or locations may also add some value, but the high number of unique values in different time periods makes this challenging. Rather than including variables to reflect all teams, it may be necessary to pick out some of the most common teams.
Overall, the models have reasonable performance. However, since the results can be obtained from a single input value, even such simple models seem overly complex.