Data: Records of four top wide-receivers career games and their stats. 

Classification: How number of receiving yards and touchdowns per player predicts whether or not the team won.



In [None]:
import pandas as pd

df = pd.read_csv('Football.csv')

df['isWin'] = df['Result'].apply(lambda x: 1 if x.startswith('W') else 0)


df_final = df[['Yds', 'TD', 'isWin']].dropna()
df_final = df_final.rename(columns={'Yds': 'Yards', 'TD': 'Touchdowns', 'isWin': 'Is Win'})

numRows = len(df_final)

print(df_final)

isWin
1    359
0    214
Name: count, dtype: int64
Yards  Touchdowns  Is Win
0      0           1         14
                   0          8
27     0           1          8
74     0           1          5
13     0           0          4
                             ..
64     2           1          1
65     0           0          1
                   1          1
       2           1          1
206    1           1          1
Name: count, Length: 363, dtype: int64
     Yards  Touchdowns  Is Win
0        0           0       1
1       49           0       0
2       81           0       0
3       36           1       1
4       93           1       1
..     ...         ...     ...
568    121           2       1
569     31           2       1
570     70           1       1
571     26           0       1
572     44           1       0

[573 rows x 3 columns]


In [None]:
y = df_final.iloc[0:numRows, 2].values
X = df_final.iloc[0:numRows, [0,1]].values

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)



[[105   0]
 [  0   0]
 [ 34   0]
 [ 90   1]
 [ 59   1]
 [179   1]
 [ 24   0]
 [ 34   0]
 [ 44   1]
 [ 44   1]
 [ 83   0]
 [ 88   0]
 [  9   0]
 [ 28   1]
 [135   1]
 [126   0]
 [ 55   0]
 [156   0]
 [ 22   0]
 [109   1]
 [ 42   0]
 [ 89   1]
 [ 43   0]
 [ 60   0]
 [ 59   0]
 [  7   0]
 [  0   0]
 [136   2]
 [ 45   0]
 [162   1]
 [115   2]
 [ 11   0]
 [ 33   0]
 [144   2]
 [ 98   0]
 [ 62   0]
 [110   0]
 [  0   0]
 [ 94   2]
 [106   1]
 [101   1]
 [ 40   0]
 [ 84   1]
 [ 32   1]
 [ 67   0]
 [167   0]
 [ 47   0]
 [  3   0]
 [ 84   0]
 [ 74   0]
 [ 74   0]
 [ 75   1]
 [  8   0]
 [ 49   0]
 [ 88   2]
 [ 42   0]
 [ 14   1]
 [180   0]
 [ 50   1]
 [141   1]
 [ 67   2]
 [116   1]
 [ 69   1]
 [ 54   0]
 [ 82   1]
 [ 31   0]
 [ 54   1]
 [112   0]
 [ 87   0]
 [ 73   0]
 [ 84   2]
 [  0   0]
 [ 46   0]
 [  6   0]
 [ 99   0]
 [ 42   0]
 [ 17   0]
 [ 94   2]
 [206   1]
 [111   1]
 [ 44   0]
 [ 69   1]
 [ 81   1]
 [ 26   0]
 [ 46   1]
 [ 21   1]
 [  0   0]
 [106   0]
 [ 53   3]
 [ 85   0]
 [ 23   0]

The first pipeline standardizes features before applying logistic regression.
The second pipeline applies a decision tree classifier directly.
The third pipeline standardizes features before using a K-nearest neighbors classifier.

In [None]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

pipe1 = make_pipeline(StandardScaler(), LogisticRegression(tol=.001, random_state=1))

pipe2 = make_pipeline(DecisionTreeClassifier(max_depth=2,
                                             criterion='entropy',
                                             random_state=1))

pipe3 = make_pipeline(StandardScaler(), SVC(kernel='linear', C=0.001, random_state=1))

from sklearn.ensemble import VotingClassifier

mv_clf = VotingClassifier(estimators=[('lr', pipe1), ('dt', pipe2), ('svm', pipe3)])

all_clf = [pipe1, pipe2, pipe3, mv_clf]

clf_labels = ['LogisticRegression', 'Decision tree', 'SVM', 'Majority voting']

print('10-fold cross validation:\n')
for clf, label in zip(all_clf, clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='accuracy')
    print("Accuracy: " + str(round(scores.mean(), 2)) + 
          " Stdev: " + str(round(scores.std(), 3)) +
          " [" + label + "]")

10-fold cross validation:

Accuracy: 0.63 Stdev: 0.003 [LogisticRegression]
Accuracy: 0.61 Stdev: 0.038 [Decision tree]
Accuracy: 0.63 Stdev: 0.003 [SVM]
Accuracy: 0.63 Stdev: 0.003 [Majority voting]


The 10-fold cross validation results show that both the Logistic Regression and SVM pipelines achieved an average accuracy of 0.63 with very low variability (Stdev: 0.003), while the Decision Tree pipeline obtained a slightly lower accuracy of 0.61 with higher variability (Stdev: 0.038).

Majority Voting:

The ensemble method combines the predictions of the individual models—Logistic Regression, Decision Tree, and SVM—by taking a vote on the predicted class for each sample. In majority voting, each model casts a vote, and the class that receives the most votes is selected as the final prediction. This approach can help balance out the weaknesses of individual models and often leads to a more robust overall prediction. In these results, the majority voting ensemble achieved an accuracy of 0.63 with low variability, aligning with the performance of the best individual models.

In [None]:
pipe1.fit(X_train, y_train)

y_pred = pipe1.predict(X_test)
print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', y_test.shape[0])
print('Accuracy:', pipe1.score(X_test, y_test))


Misclassified test set examples: 64
Out of a total of: 172
Accuracy: 0.627906976744186


In [55]:
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

X_combined_std = np.vstack((X_train_std, X_test_std))
y_combined = np.hstack((y_train, y_test))

In [58]:
pipe2.fit(X_train, y_train)
y_pred = pipe2.predict(X_test)
print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', y_test.shape[0])
print('Accuracy:', pipe2.score(X_test, y_test))

Misclassified test set examples: 61
Out of a total of: 172
Accuracy: 0.6453488372093024


In [41]:
pipe3.fit(X_train, y_train)

y_pred = pipe3.predict(X_test)
print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', y_test.shape[0])
print('Accuracy:', pipe3.score(X_test, y_test))

Misclassified test set examples: 64
Out of a total of: 172
Accuracy: 0.627906976744186


In [37]:
mv_clf.fit(X_train, y_train)

y_pred = mv_clf.predict(X_test)
print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', y_test.shape[0])
print('Accuracy:', mv_clf.score(X_test, y_test))

Misclassified test set examples: 64
Out of a total of: 172
Accuracy: 0.627906976744186


The results of the testing data mostly align with that of the cross-validation, but the decision tree does better than all the other models with the testing data. This is most likely just a coincidence of the testing data since it is not majorly outperforming the other models. 