# Introduction

Hopefully, this short tutorial can show you a lot of different commands that will help you gain the most insights into your dataset. 

In [2]:
import pandas as pd
from src.utils import load_data_from_google_drive

# Loading in Data

The first step in any ML problem is identifying what format your data is in, and then loading it into whatever framework you're using. For Kaggle compeitions, a lot of data can be found in CSV files, so that's the example we're going to use. 

We're going to be looking at a sports dataset that shows the results from NCAA basketball games from 1985 to 2016. This dataset is in a CSV file, and the function we're going to use to read in the file is called **pd.read_csv()**. This function returns a **dataframe** variable. The dataframe is the golden jewel data structure for Pandas. It is defined as "a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)".

Just think of it as a table for now. 

In [3]:
df = load_data_from_google_drive(url='https://drive.google.com/file/d/184JcLbSpArA_uq0DgAv2k892KChJVPHt/view?usp=share_link')

In [4]:
df

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0
...,...,...,...,...,...,...,...,...
145284,2016,132,1114,70,1419,50,N,0
145285,2016,132,1163,72,1272,58,N,0
145286,2016,132,1246,82,1401,77,N,1
145287,2016,132,1277,66,1345,62,N,0


# The Basics

Now that we have our dataframe in our variable df, let's look at what it contains. We can use the function **head()** to see the first couple rows of the dataframe (or the function **tail()** to see the last few rows).

In [5]:
df.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


In [6]:
df.tail()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
145284,2016,132,1114,70,1419,50,N,0
145285,2016,132,1163,72,1272,58,N,0
145286,2016,132,1246,82,1401,77,N,1
145287,2016,132,1277,66,1345,62,N,0
145288,2016,132,1386,87,1433,74,N,0


We can see the dimensions of the dataframe using the the **shape** attribute

In [7]:
df.shape

(145289, 8)

We can also extract all the column names as a list, by using the **columns** attribute and can extract the rows with the **index** attribute

In [8]:
df.columns.tolist()

['Season', 'Daynum', 'Wteam', 'Wscore', 'Lteam', 'Lscore', 'Wloc', 'Numot']

In order to get a better idea of the type of data that we are dealing with, we can call the **describe()** function to see statistics like mean, min, etc about each column of the dataset. 

In [9]:
df.describe()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Numot
count,145289.0,145289.0,145289.0,145289.0,145289.0,145289.0,145289.0
mean,2001.574834,75.223816,1286.720646,76.600321,1282.864064,64.497009,0.044387
std,9.233342,33.287418,104.570275,12.173033,104.829234,11.380625,0.247819
min,1985.0,0.0,1101.0,34.0,1101.0,20.0,0.0
25%,1994.0,47.0,1198.0,68.0,1191.0,57.0,0.0
50%,2002.0,78.0,1284.0,76.0,1280.0,64.0,0.0
75%,2010.0,103.0,1379.0,84.0,1375.0,72.0,0.0
max,2016.0,132.0,1464.0,186.0,1464.0,150.0,6.0


Okay, so now let's looking at information that we want to extract from the dataframe. Let's say I wanted to know the max value of a certain column. The function **max()** will show you the maximum values of all columns

In [10]:
df.max()

Season    2016
Daynum     132
Wteam     1464
Wscore     186
Lteam     1464
Lscore     150
Wloc         N
Numot        6
dtype: object

Then, if you'd like to specifically get the max value for a particular column, you pass in the name of the column using the bracket indexing operator

In [11]:
df['Wscore'].max()

186

If you'd like to find the mean of the Losing teams' score. 

In [12]:
df['Lscore'].mean()

64.49700940883343

But what if that's not enough? Let's say we want to actually see the game(row) where this max score happened. We can call the **argmax()** function to identify the row index

In [13]:
df['Wscore'].argmax()

24970

One of the most useful functions that you can call on certain columns in a dataframe is the **value_counts()** function. It shows how many times each item appears in the column. This particular command shows the number of games in each season

In [14]:
df['Season'].value_counts()

2016    5369
2014    5362
2015    5354
2013    5320
2010    5263
2012    5253
2009    5249
2011    5246
2008    5163
2007    5043
2006    4757
2005    4675
2003    4616
2004    4571
2002    4555
2000    4519
2001    4467
1999    4222
1998    4167
1997    4155
1992    4127
1991    4123
1996    4122
1995    4077
1994    4060
1990    4045
1989    4037
1993    3982
1988    3955
1987    3915
1986    3783
1985    3737
Name: Season, dtype: int64

**Q**: How many unique seasons are there in the dataset? Use the nunique() function.

In [15]:
# Write your code here
NumberOfSeasons=df['Season'].nunique()

print('Number of unique seasons in the dataset are:', NumberOfSeasons)

print()
print('There are 32 different seasons represented in the dataset. \nEach season can occur a different number of times in the dataset, but in total, there are 32 unique seasons represented.')

Number of unique seasons in the dataset are: 32

There are 32 different seasons represented in the dataset. 
Each season can occur a different number of times in the dataset, but in total, there are 32 unique seasons represented.


**Q**: Find the team with the most wins. Use the value_counts() function on the Wteam column.

In [16]:
# Write your code here
TeamWithMostWins=df['Wteam'].value_counts()[:1]
print('Team with the most wins:')
print()
print("TeamID, Wins:", TeamWithMostWins)


Team with the most wins:

TeamID, Wins: 1181    819
Name: Wteam, dtype: int64


In [17]:
#This output gives us the team IDs and their corresponding counts of wins. 
#For Team ID: 1181 and Wins: 819. 
#It is to seen that this team has had the most wins as its count is the highest.

# Acessing Values

Then, in order to get attributes about the game, we need to use the **iloc[]** function. Iloc is definitely one of the more important functions. The main idea is that you want to use it whenever you have the integer index of a certain row that you want to access. As per Pandas documentation, iloc is an "integer-location based indexing for selection by position."

In [18]:
df.iloc[[df['Wscore'].argmax()]]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
24970,1991,68,1258,186,1109,140,H,0


Let's take this a step further. Let's say you want to know the game with the highest scoring winning team (this is what we just calculated), but you then want to know how many points the losing team scored. 

In [19]:
df.iloc[[df['Wscore'].argmax()]]['Lscore']

24970    140
Name: Lscore, dtype: int64

When you see data displayed in the above format, you're dealing with a Pandas **Series** object, not a dataframe object.

In [20]:
type(df.iloc[[df['Wscore'].argmax()]]['Lscore'])

pandas.core.series.Series

In [21]:
type(df.iloc[[df['Wscore'].argmax()]])

pandas.core.frame.DataFrame

The following is a summary of the 3 data structures in Pandas (Haven't ever really used Panels yet)

![](DataStructures.png)

When you want to access values in a Series, you'll want to just treat the Series like a Python dictionary, so you'd access the value according to its key (which is normally an integer index)

In [22]:
df.iloc[[df['Wscore'].argmax()]]['Lscore'][24970]

140

The other really important function in Pandas is the **loc** function. Contrary to iloc, which is an integer based indexing, loc is a "Purely label-location based indexer for selection by label". Since all the games are ordered from 0 to 145288, iloc and loc are going to be pretty interchangable in this type of dataset

In [23]:
df.iloc[:3]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0


In [24]:
df.loc[:3]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0


Notice the slight difference in that iloc is exclusive of the second number, while loc is inclusive. 

Below is an example of how you can use loc to acheive the same task as we did previously with iloc

In [25]:
df.loc[df['Wscore'].argmax(), 'Lscore']

140

A faster version uses the **at()** function. At() is really useful wheneever you know the row label and the column label of the particular value that you want to get. 

In [26]:
df.at[df['Wscore'].argmax(), 'Lscore']

140

If you'd like to see more discussion on how loc and iloc are different, check out this great Stack Overflow post: http://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation. Just remember that **iloc looks at position** and **loc looks at labels**. Loc becomes very important when your row labels aren't integers. 

# Sorting

Let's say that we want to sort the dataframe in increasing order for the scores of the losing team

In [27]:
df.sort_values('Lscore').head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
100027,2008,66,1203,49,1387,20,H,0
49310,1997,66,1157,61,1204,21,H,0
89021,2006,44,1284,41,1343,21,A,0
85042,2005,66,1131,73,1216,22,H,0
103660,2009,26,1326,59,1359,22,H,0


**Q**: Make three dataframes that are sorted by season, winning team, and winning score respectively. Then, Using iloc, select the rows from index 100 to 200 and the columns for season, winning team, and winning score, respectively. 

In [28]:
#write your code here

#1. The first dataframe sorted by season.
df_sorted_by_season = df.sort_values('Season')
print("Dataframe sorted by season:")
print(df_sorted_by_season.head(2))

print()

#2. The second dataframe should be sorted by winning team.
df_sorted_by_wteam = df.sort_values('Wteam')
print("Dataframe sorted by winning team:")
print(df_sorted_by_wteam.head(2))

print ()

#3. The third dataframe should be sorted by winning score.
df_sorted_by_wscore = df.sort_values('Wscore')
print("Dataframe sorted by winning score:")
print(df_sorted_by_wscore.head(2))

print()


Dataframe sorted by season:
      Season  Daynum  Wteam  Wscore  Lteam  Lscore Wloc  Numot
0       1985      20   1228      81   1328      64    N      0
2484    1985      99   1275      73   1132      63    H      0

Dataframe sorted by winning team:
        Season  Daynum  Wteam  Wscore  Lteam  Lscore Wloc  Numot
140642    2016      25   1101      72   1197      62    N      0
136740    2015      62   1101      87   1146      70    H      0

Dataframe sorted by winning score:
        Season  Daynum  Wteam  Wscore  Lteam  Lscore Wloc  Numot
128582    2013     116   1264      34   1193      31    H      0
69336     2001     126   1206      35   1423      33    N      0



In [29]:
#As iloc is exclusive, I wrote [100:201], because that will give me [100:200]
#These codes will select the rows from index 100 to 200 and the columns 'Season', 'Wteam', and 'Wscore' from the respective sorted dataframes above.

In [30]:

subset_season = df_sorted_by_season[['Season', 'Wteam', 'Wscore']].iloc[100:201]



In [31]:
subset_wteam = df_sorted_by_wteam[['Season', 'Wteam', 'Wscore']].iloc[100:201]


In [32]:
subset_wscore = df_sorted_by_wscore[['Season', 'Wteam', 'Wscore']].iloc[100:201]


**Q**: From these three subsets you obtained above, find the season and winning team for the game with the highest winning score.

In [33]:
#write your code here

print('The season and winning team for the game with the highest winning score from the different subsets - from index [100:200]:')
print()
#1. 
# I am sorting the subset_season dataFrame by winning score in descending order
subset_season_sorted = subset_season.sort_values('Wscore', ascending=False)

# I am selecting the first row which is the highest winning score
highest_score_game = subset_season_sorted.iloc[0]

print("For subset_season:")
print("Season of the game with the highest winning score:", highest_score_game['Season'])
print("Winning team of the game with the highest winning score:", highest_score_game['Wteam'])

print()


#2. 
# I am sorting the subset_season dataFrame by winning score in descending order
subset_wteam_sorted = subset_wteam.sort_values('Wscore', ascending=False)

# I am selecting the first row which is the highest winning score
highest_score_game = subset_wteam_sorted.iloc[0]

print("For subset_wteam:")
print("Season of the game with the highest winning score:", highest_score_game['Season'])
print("Winning team of the game with the highest winning score:", highest_score_game['Wteam'])

print()

#3. 
# I am sorting the subset_season dataFrame by winning score in descending order
subset_wscore_sorted = subset_wscore.sort_values('Wscore', ascending=False)

# I am selecting the first row which is the highest winning score
highest_score_game = subset_wscore_sorted.iloc[0]

print("For subset_wscore:")
print("Season of the game with the highest winning score:", highest_score_game['Season'])
print("Winning team of the game with the highest winning score:", highest_score_game['Wteam'])


# This gives me three different results. And that is because I created 3 different dataframes. 
# One sorted by season, one sorted by winning team and one sorted by winning score 
# So they are all sorted differently and therefore the chosen index' of [100:201] gives me three different subsets.



The season and winning team for the game with the highest winning score from the different subsets - from index [100:200]:

For subset_season:
Season of the game with the highest winning score: 1985
Winning team of the game with the highest winning score: 1260

For subset_wteam:
Season of the game with the highest winning score: 1991
Winning team of the game with the highest winning score: 1102

For subset_wscore:
Season of the game with the highest winning score: 2014
Winning team of the game with the highest winning score: 1155


# Filtering Rows Conditionally

Now, let's say we want to find all of the rows that satisy a particular condition. For example, I want to find all of the games where the winning team scored more than 150 points. The idea behind this command is you want to access the column 'Wscore' of the dataframe df (df['Wscore']), find which entries are above 150 (df['Wscore'] > 150), and then returns only those specific rows in a dataframe format (df[df['Wscore'] > 150]).

In [34]:
df[df['Wscore'] > 150]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
5269,1986,75,1258,151,1109,107,H,0
12046,1988,40,1328,152,1147,84,H,0
12355,1988,52,1328,151,1173,99,N,0
16040,1989,40,1328,152,1331,122,H,0
16853,1989,68,1258,162,1109,144,A,0
17867,1989,92,1258,181,1109,150,H,0
19653,1990,30,1328,173,1109,101,H,0
19971,1990,38,1258,152,1109,137,A,0
20022,1990,40,1116,166,1109,101,H,0
22145,1990,97,1258,157,1362,115,H,0


This also works if you have multiple conditions. Let's say we want to find out when the winning team scores more than 150 points and when the losing team scores below 100. 

In [35]:
df[(df['Wscore'] > 150) & (df['Lscore'] < 100)]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
12046,1988,40,1328,152,1147,84,H,0
12355,1988,52,1328,151,1173,99,N,0
25656,1991,84,1106,151,1212,97,H,0
28687,1992,54,1261,159,1319,86,H,0
35023,1993,112,1380,155,1341,91,A,0
52600,1998,33,1395,153,1410,87,H,0


**Q**: Create a new column in the DataFrame called 'ScoreDifference' which is the absolute difference between the winning score and the losing score. Filter the DataFrame to only include games where the 'ScoreDifference' is greater than the average 'ScoreDifference' for all games.

In [36]:
# Write your code here

# I am creating a new column for 'ScoreDifference' called ScoreDiff
df['ScoreDiff'] = abs(df['Wscore'] - df['Lscore'])

# The average 'ScoreDifference' for all games
avg_ScoreDiff = df['ScoreDiff'].mean()

# I am filtering the DataFrame based on the asked condition 
New_df = df[df['ScoreDiff'] > avg_ScoreDiff]


New_df.head(2)

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,ScoreDiff
0,1985,20,1228,81,1328,64,N,0,17
3,1985,25,1165,70,1432,54,H,0,16


**Q**: From this filtered DataFrame, find the season and teams involved in the game with the highest 'ScoreDifference'.

In [37]:
# Write your code here

#iloc[['scoreDif'].argmax()]

#New_df.iloc[New_df['ScoreDiff'].argmax()]

max_ScoreDiff_index = New_df['ScoreDiff'].idxmax()
season, winningteam, losingteam = New_df.loc[max_ScoreDiff_index, ['Season', 'Wteam','Lteam']]


print("Firstly, the highest scoredifference was:", New_df['ScoreDiff'].max())
print()
print('The season and teams involved in the game with the highest scoredifference:')

print("Season:", season)
print("Winning team:", winningteam)
print("Losing team:", losingteam)

Firstly, the highest scoredifference was: 91

The season and teams involved in the game with the highest scoredifference:
Season: 1996
Winning team: 1409
Losing team: 1341


# Grouping

Another important function in Pandas is **groupby()**. This is a function that allows you to group entries by certain attributes (e.g Grouping entries by Wteam number) and then perform operations on them. The following function groups all the entries (games) with the same Wteam number and finds the mean for each group. 

In [38]:
df.groupby('Wteam')['Wscore'].mean().head()

Wteam
1101    78.111111
1102    69.893204
1103    75.839768
1104    75.825944
1105    74.960894
Name: Wscore, dtype: float64

This next command groups all the games with the same Wteam number and finds where how many times that specific team won at home, on the road, or at a neutral site

In [39]:
df.groupby('Wteam')['Wloc'].value_counts().head(9)

Wteam  Wloc
1101   H        12
       A         3
       N         3
1102   H       204
       A        73
       N        32
1103   H       324
       A       153
       N        41
Name: Wloc, dtype: int64

Each dataframe has a **values** attribute which is useful because it basically displays your dataframe in a numpy array style format

In [40]:
df.values

array([[1985, 20, 1228, ..., 'N', 0, 17],
       [1985, 25, 1106, ..., 'H', 0, 7],
       [1985, 25, 1112, ..., 'H', 0, 7],
       ...,
       [2016, 132, 1246, ..., 'N', 1, 5],
       [2016, 132, 1277, ..., 'N', 0, 4],
       [2016, 132, 1386, ..., 'N', 0, 13]], dtype=object)

Now, you can simply just access elements like you would in an array. 

In [41]:
df.values[0][0]

1985

**Q**: Group the DataFrame by season and find the average winning score for each season.

In [42]:
# Write your code here
print('The seasons and the average winning score for each season are as following: (I am however only showing the first 5)')
df.groupby('Season')['Wscore'].mean().head().reset_index()

The seasons and the average winning score for each season are as following: (I am however only showing the first 5)


Unnamed: 0,Season,Wscore
0,1985,74.72304
1,1986,74.81364
2,1987,77.99387
3,1988,79.773704
4,1989,81.728511


**Q**: Group the DataFrame by winning team and find the maximum winning score for each team across all seasons.

In [43]:
# Write your code here
df.groupby('Wteam')[['Wscore', 'Season']].max().reset_index()

Unnamed: 0,Wteam,Wscore,Season
0,1101,95,2016
1,1102,111,2016
2,1103,109,2016
3,1104,114,2016
4,1105,114,2016
...,...,...,...
359,1460,136,2016
360,1461,112,2016
361,1462,125,2016
362,1463,105,2016


**Q**: Group the DataFrame by both season and winning team. Find the team with the highest average winning score for each season.

In [44]:
# Write your code here
df.groupby(['Season', 'Wteam'])['Wscore'].mean().reset_index().sort_values('Wscore', ascending=False).groupby('Season').first()



Unnamed: 0_level_0,Wteam,Wscore
Season,Unnamed: 1_level_1,Unnamed: 2_level_1
1985,1328,92.8
1986,1109,91.2
1987,1380,95.875
1988,1258,111.75
1989,1258,117.315789
1990,1258,126.347826
1991,1380,112.3125
1992,1380,99.642857
1993,1380,101.875
1994,1380,106.583333


**Q**: Create a new DataFrame that counts the number of wins for each team in each season. This will involve grouping by both season and winning team, and then using the count() function.

In [45]:
# Write your code here
New_df2=df.groupby(['Season', 'Wteam'])['Wteam'].count()
print('The number of wins for each team for each season looks as following:')

print(New_df2)

#I am counting how many time a specific Wteam is occuring in the column Wteam for every season in the dataset. 
#Because this counts the number of wins. 

The number of wins for each team for each season looks as following:
Season  Wteam
1985    1102      5
        1103      9
        1104     21
        1106     10
        1108     19
                 ..
2016    1460     20
        1461     12
        1462     27
        1463     21
        1464      9
Name: Wteam, Length: 10172, dtype: int64


**Q**: For each season, find the team with the most wins. This will involve creating a DataFrame similar to the one in task 5, and then using the idxmax() function for each season.

In [47]:
# Write your code here

# Group the DataFrame by season and count the occurrences of each team
team_counts = df.groupby('Season')['Wteam'].value_counts()

# Find the team with the most wins for each season
teams_with_most_wins = team_counts.groupby('Season').idxmax().reset_index()

print(teams_with_most_wins)

    Season         Wteam
0     1985  (1985, 1385)
1     1986  (1986, 1181)
2     1987  (1987, 1424)
3     1988  (1988, 1112)
4     1989  (1989, 1328)
5     1990  (1990, 1247)
6     1991  (1991, 1116)
7     1992  (1992, 1181)
8     1993  (1993, 1231)
9     1994  (1994, 1163)
10    1995  (1995, 1116)
11    1996  (1996, 1269)
12    1997  (1997, 1242)
13    1998  (1998, 1242)
14    1999  (1999, 1181)
15    2000  (2000, 1409)
16    2001  (2001, 1181)
17    2002  (2002, 1153)
18    2003  (2003, 1166)
19    2004  (2004, 1390)
20    2005  (2005, 1228)
21    2006  (2006, 1181)
22    2007  (2007, 1242)
23    2008  (2008, 1272)
24    2009  (2009, 1272)
25    2010  (2010, 1242)
26    2011  (2011, 1242)
27    2012  (2012, 1246)
28    2013  (2013, 1211)
29    2014  (2014, 1455)
30    2015  (2015, 1246)
31    2016  (2016, 1242)


**Q**: Group the DataFrame by losing team and find the average losing score for each team across all seasons. Compare this with the average winning score for each team from task 3. Are there teams that have a higher average losing score than winning score?

In [48]:
# Write your code here
df.groupby(['Season', 'Lteam'])['Lscore'].mean().reset_index().sort_values('Lscore', ascending=False).groupby('Season').first()


Unnamed: 0_level_0,Lteam,Lscore
Season,Unnamed: 1_level_1,Unnamed: 2_level_1
1985,1260,81.0
1986,1109,90.157895
1987,1424,88.0
1988,1379,89.6
1989,1258,99.2
1990,1258,117.8
1991,1258,95.866667
1992,1258,86.615385
1993,1246,85.666667
1994,1407,91.785714


# Dataframe Iteration

In order to iterate through dataframes, we can use the **iterrows()** function. Below is an example of what the first two rows look like. Each row in iterrows is a Series object

In [49]:
for index, row in df.iterrows():
    print(row)
    if index == 1:
        break

Season       1985
Daynum         20
Wteam        1228
Wscore         81
Lteam        1328
Lscore         64
Wloc            N
Numot           0
ScoreDiff      17
Name: 0, dtype: object
Season       1985
Daynum         25
Wteam        1106
Wscore         77
Lteam        1354
Lscore         70
Wloc            H
Numot           0
ScoreDiff       7
Name: 1, dtype: object


**Q**: Create a new column 'HighScoringGame' that is 'Yes' if the winning score is greater than 100 and 'No' otherwise. This will require iterating over the rows of the DataFrame and checking the value of the winning score for each row.

In [50]:
# Write your code here

#I am creating an empty list, where the values can be stored
HighScoringGame = []

for index, row in df.iterrows():
    if row['Wscore'] > 100:
        HighScoringGame.append('Yes')
    else:
        HighScoringGame.append('No')
    

#This creates a new column called 'HighScoringGame' in df and assigns it the values from the HighScoringGame list.
df['HighScoringGame'] = HighScoringGame


print(df.head(2))


   Season  Daynum  Wteam  Wscore  Lteam  Lscore Wloc  Numot  ScoreDiff  \
0    1985      20   1228      81   1328      64    N      0         17   
1    1985      25   1106      77   1354      70    H      0          7   

  HighScoringGame  
0              No  
1              No  


**Q**: Calculate the total number of games played by each team, whether they won or lost. This will require iterating over the rows of the DataFrame and updating a dictionary that keeps track of the number of games for each team.

In [61]:
# Write your code here
# Initialize an empty dictionary to store the number of games for each team
team_games = {}

# Iterate over the rows of the DataFrame
for index, row in df.iterrows():
    # Extract the winning and losing teams
    winning_team = row['Wteam']
    losing_team = row['Lteam']

    # Update the count for the winning team
    if winning_team in team_games:
        team_games[winning_team] += 1
    else:
        team_games[winning_team] = 1

    # Update the count for the losing team
    if losing_team in team_games:
        team_games[losing_team] += 1
    else:
        team_games[losing_team] = 1


print("The total number of games played by each team are as follows:")
print("(Team, number of games):")
for team, games in team_games.items():
    print(team, games)

The total number of games played by each team are as follows:
(Team, number of games):
1228 992
1328 968
1106 855
1354 906
1112 981
1223 363
1165 833
1432 69
1192 908
1447 903
1218 931
1337 922
1226 847
1242 993
1268 969
1260 914
1133 949
1305 922
1424 974
1307 969
1288 925
1344 951
1438 952
1374 916
1411 903
1412 962
1397 963
1417 966
1225 880
1116 980
1368 808
1120 936
1391 879
1135 847
1306 898
1143 947
1388 897
1153 970
1184 863
1159 887
1171 829
1216 930
1173 960
1134 200
1177 942
1296 879
1193 942
1265 934
1196 981
1416 881
1206 938
1137 912
1210 972
1149 824
1211 921
1102 840
1234 968
1114 910
1332 927
1243 927
1317 883
1257 994
1231 969
1277 966
1145 934
1278 948
1453 912
1286 851
1186 849
1301 979
1144 850
1325 942
1384 887
1326 968
1248 896
1287 857
1339 879
1334 899
1365 907
1375 896
1126 906
1403 939
1152 865
1423 931
1347 858
1429 930
1428 931
1437 983
1436 911
1172 879
1439 946
1330 963
1443 942
1121 175
1455 947
1249 873
1241 942
1440 825
1314 1010
1200 922
1323 950
1264

**Q**: For each season, find the game with the highest score difference (winning score - losing score). This will require iterating over the rows of the DataFrame, keeping track of the highest score difference for each season, and updating it if a game with a higher score difference is found.

In [52]:
max_score_diff_games = df.groupby('Season')['ScoreDiff'].idxmax()

# The game with the maximum score difference for each season
max_score_diff = df.loc[max_score_diff_games]

# Print the game with the highest score difference for each season
for index, row in max_score_diff.iterrows():
    season = row['Season']
    index = row.name  # I assume that the index represents the game number because every row in the dataframe describes a game.
    winning_team = row['Wteam']
    losing_team = row['Lteam']
    score_diff = row['ScoreDiff']
    print(f"Season: {season}, Game Number: {index}, Winning Team: {winning_team}, Losing Team: {losing_team}, Score Difference: {score_diff}")

Season: 1985, Game Number: 236, Winning Team: 1361, Losing Team: 1288, Score Difference: 60
Season: 1986, Game Number: 4731, Winning Team: 1314, Losing Team: 1264, Score Difference: 84
Season: 1987, Game Number: 8240, Winning Team: 1155, Losing Team: 1118, Score Difference: 73
Season: 1988, Game Number: 12046, Winning Team: 1328, Losing Team: 1147, Score Difference: 68
Season: 1989, Game Number: 16677, Winning Team: 1242, Losing Team: 1135, Score Difference: 70
Season: 1990, Game Number: 19502, Winning Team: 1181, Losing Team: 1217, Score Difference: 76
Season: 1991, Game Number: 25161, Winning Team: 1163, Losing Team: 1148, Score Difference: 68
Season: 1992, Game Number: 27997, Winning Team: 1116, Losing Team: 1126, Score Difference: 82
Season: 1993, Game Number: 33858, Winning Team: 1328, Losing Team: 1197, Score Difference: 81
Season: 1994, Game Number: 36404, Winning Team: 1228, Losing Team: 1152, Score Difference: 69
Season: 1995, Game Number: 39858, Winning Team: 1246, Losing Tea

Remember, iterating over a DataFrame should generally be avoided if a vectorized operation can be used instead, as vectorized operations are usually much faster. However, these tasks are designed to give practice with DataFrame iteration for cases where it might be necessary.

Vectorized Operation Example: Create a new column 'HighScoringGame' in the DataFrame using a vectorized operation. This column should contain 'Yes' if the winning score is greater than 100 and 'No' otherwise. Use the np.where function from the numpy library for this task.

In [53]:
import numpy as np
df['HighScoringGame'] = np.where(df['Wscore'] > 100, 'Yes', 'No')

**Q**: Vectorized Operation: Calculate the total number of games played by each team, whether they won or lost. Instead of iterating over the DataFrame, use the value_counts() function on the winning team and losing team columns separately, and then add the two Series together.

In [60]:
# Write your code here
# For every team every wins are counted, erverytime a team occurs in the Wteam column
wins = df['Wteam'].value_counts()

# For every team every losses are counted, erverytime a team occurs in the Lteam column
losses = df['Lteam'].value_counts()

# I am adding the wins and losses together to get the total number of games
total_games = wins.add(losses)


print("The total number of games played by each team are as following:" )
print("(Team, number of games):")
print(total_games.head(12))

The total number of games played by each team are as following:
(Team, number of games):
1101     76
1102    840
1103    910
1104    975
1105    447
1106    855
1107    512
1108    866
1109    169
1110    910
1111    876
1112    981
dtype: int64


**Q**: For each season, find the game with the highest score difference (winning score - losing score). Instead of iterating over the DataFrame, create a new column 'ScoreDifference' using vectorized subtraction, then use the groupby() function and idxmax() function to find the game with the highest score difference for each season.

In [57]:
# Write your code here

#we already calculated a Scoredifference before, so i will call that one instead. However the comment below is the method. 
#df['ScoreDifference'] = df['Wscore'] - df['Lscore'] 

games_with_highest_score_diff = df.loc[df.groupby('Season')['ScoreDiff'].idxmax()]

print(games_with_highest_score_diff.head(3))

      Season  Daynum  Wteam  Wscore  Lteam  Lscore Wloc  Numot  ScoreDiff  \
236     1985      33   1361     128   1288      68    H      0         60   
4731    1986      60   1314     129   1264      45    N      0         84   
8240    1987      51   1155     112   1118      39    H      0         73   

     HighScoringGame  
236              Yes  
4731             Yes  
8240             Yes  


# Extracting Rows and Columns

The bracket indexing operator is one way to extract certain columns from a dataframe.

In [None]:
df[['Wscore', 'Lscore']].head()

Unnamed: 0,Wscore,Lscore
0,81,64
1,77,70
2,63,56
3,70,54
4,86,74


Notice that you can acheive the same result by using the loc function. Loc is a veryyyy versatile function that can help you in a lot of accessing and extracting tasks. 

In [None]:
df.loc[:, ['Wscore', 'Lscore']].head()

Unnamed: 0,Wscore,Lscore
0,81,64
1,77,70
2,63,56
3,70,54
4,86,74


Note the difference is the return types when you use brackets and when you use double brackets. 

In [None]:
type(df['Wscore'])

pandas.core.series.Series

In [None]:
type(df[['Wscore']])

pandas.core.frame.DataFrame

You've seen before that you can access columns through df['col name']. You can access rows by using slicing operations. 

In [None]:
df[0:3]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,scoreDif,HighScoringGame
0,1985,20,1228,81,1328,64,N,0,17,No
1,1985,25,1106,77,1354,70,H,0,7,No
2,1985,25,1112,63,1223,56,H,0,7,No


Here's an equivalent using iloc

In [None]:
df.iloc[0:3,:]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,scoreDif,HighScoringGame
0,1985,20,1228,81,1328,64,N,0,17,No
1,1985,25,1106,77,1354,70,H,0,7,No
2,1985,25,1112,63,1223,56,H,0,7,No


# Data Cleaning

One of the big jobs of doing well in Kaggle competitions is that of data cleaning. A lot of times, the CSV file you're given (especially like in the Titanic dataset), you'll have a lot of missing values in the dataset, which you have to identify. The following **isnull** function will figure out if there are any missing values in the dataframe, and will then sum up the total for each column. In this case, we have a pretty clean dataset.

In [None]:
df.isnull().sum()

Season             0
Daynum             0
Wteam              0
Wscore             0
Lteam              0
Lscore             0
Wloc               0
Numot              0
scoreDif           0
HighScoringGame    0
dtype: int64

If you do end up having missing values in your datasets, be sure to get familiar with these two functions. 
* **dropna()** - This function allows you to drop all(or some) of the rows that have missing values. 
* **fillna()** - This function allows you replace the rows that have missing values with the value that you pass in.

# Other Useful Functions

* **drop()** - This function removes the column or row that you pass in (You also have the specify the axis). 
* **agg()** - The aggregate function lets you compute summary statistics about each group
* **apply()** - Lets you apply a specific function to any/all elements in a Dataframe or Series
* **get_dummies()** - Helpful for turning categorical data into one hot vectors.
* **drop_duplicates()** - Lets you remove identical rows

# Lots of Other Great Resources

Pandas has been around for a while and there are a lot of other good resources if you're still interested on getting the most out of this library. 
* http://pandas.pydata.org/pandas-docs/stable/10min.html
* https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
* http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
* https://www.dataquest.io/blog/pandas-python-tutorial/
* https://drive.google.com/file/d/0ByIrJAE4KMTtTUtiVExiUGVkRkE/view
* https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y