# File Reading Example - College Basketball Dataset
Datafile: "College Basketball Dataset.csv"<br>
Datafile description:"College basketball-Description.pdf"

Updated by Jingwei Liu (2021-2-15)


### Let's first define some functions to show the data we will read and help us get some basic information about the dataset.
In this example, the dataset will be read as<font color = "red"> **a list of lists** </font>, so we will define a small function to show the values in the list of lists

In [23]:
# Define a "ShowData" function 
#  dataset is a list of lists
def ShowData(dataset = [["No dataset sent"]]):
    for r in dataset:
        # print elements in a tab-separated format
        print ("\t".join(r))

# sample calls
ShowData([["one", "two", "three"], ["four", "five", "six"], ["seven", "eight", "nine"]])

one	two	three
four	five	six
seven	eight	nine


In the above cell, we define a function named <font color = "blue">**ShowData**</font> and this function has only one input parameter which is called **dataset**. The **dataset** parameter has a default value which means the function can work even you don't manually assign a value to it. You can try it yourself.

### Also, it is always good to check the shape of the dataset you read 
*The function below will show the number of rows and columns in the list of lists.*
<br>
**Here, Row number means how many elements in the list. Column number means how many elements in each element list.**

In [24]:
# Define a "ShowRowsAndCols" function which show the number of rows and columns in the dataset
# dataset is a list of list. Row number means how many elements in the list. Column number means how many elements in each element list.
def ShowRowsAndCols(dataset = [["No dataset sent"]]): 
    print("There are {} rows in the dataset".format(len(dataset)))
    print("There are {} columns in the dataset".format(len(dataset[1])))
    
# sample calls
ShowRowsAndCols([["one", "two", "three"], ["four", "five", "six"]])

There are 2 rows in the dataset
There are 3 columns in the dataset


### Now, let's read the data set into a list of lists using different programming codes

One thing you should know is , each element in the list of lists is stored <font color = "red">**as string**</font>. (even it is a number)

In [25]:
# Initial version - "standard programming"
#
# Define a list for the data.  Will be a list of lists.
data1 = []
# open the file
fname = "../Data/College Basketball Dataset.csv"
f = open(fname, "r")
# you can select which part of the dataset you want to read
# Ex. we can ignore the first 5 lines
for i in range(6):
    line = f.readline()
# loop until we run out of lines
while (line):
    # strip the newline and tokenize (split on commas, in this case)
    tokens = line.rstrip().split(',')
    # append this record to the dataset
    data1.append(tokens)
    # read the next line
    line = f.readline()
# close the file
f.close()
# show the data
ShowData(data1)

Gonzaga	WCC	39	37	117.8	86.3	0.9728	56.6	41.1	16.2	17.1	30	26.2	39	26.9	56.3	40	38.2	29	71.5	7.7	2ND	1	2017
Duke	ACC	39	35	125.2	90.6	0.9764	56.6	46.5	16.3	18.6	35.8	30.2	39.8	23.9	55.9	46.3	38.7	31.4	66.4	10.7	Champions	1	2015
Virginia	ACC	38	35	123	89.9	0.9736	55.2	44.7	14.7	17.5	30.4	25.4	29.1	26.3	52.5	45.7	39.5	28.9	60.7	11.1	Champions	1	2019
North Carolina	ACC	39	33	121	91.5	0.9615	51.7	48.1	16.2	18.6	41.3	25	34.3	31.6	51	46.3	35.5	33.9	72.8	8.4	Champions	1	2017
Villanova	BE	40	35	123.1	90.9	0.9703	56.1	46.7	16.3	20.6	28.2	29.4	34.1	30	57.4	44.1	36.2	33.9	66.7	8.9	Champions	2	2016
Villanova	BE	40	36	128.4	94.1	0.9725	59.5	48.5	15	18.2	29.6	27.1	29.4	26.7	59	49	40.1	31.7	69.6	10.6	Champions	1	2018
Louisville	ACC	36	27	109.4	87.4	0.929	47.7	44	17.2	21.3	34.7	30.8	38.7	33.3	48.4	43.3	30.7	30.3	65.6	5.8	E8	4	2015
Notre Dame	ACC	38	32	125.3	98.6	0.9401	58.3	47.9	14.5	17.3	27.9	32.2	36.7	24.1	58.2	47.4	39	32.6	63.9	8.6	E8	3	2015
Notre Dame	ACC	36	24	118.3	103.3	0.8269	54	49.5	15.3	14.8

After running the above cell, you should see the data is read as a list of lists. We read all rows in the dataset and each row is a list and also an element of a bigger list. **So, that's why we call this a list of lists**

Now, Let's try to check the value and data type of the <font color = "red">**third**</font> element of the <font color = "red">**first**</font> row *(keep in mind that the subscript in python starts from 0)*

In [26]:
data1[0][2]

'39'

In [27]:
type(data1[0][3])

str

#### A Python-esque version of the code.
You can see in this cell, it uses fewer lines to do the same work.  For your assignment, you are free to use any of the code versions as a starting point.

In [28]:
#
# Python-esque version 1
#
# Grab all the lines from the file starting with line 1, strip
# the newline and tokenize
with open("../Data/College Basketball Dataset.csv") as f:
    data2 = [line.rstrip().split(',') for line in f.readlines()[1:]]
# show the data
ShowData(data2)


North Carolina	ACC	40	33	123.3	94.9	0.9531	52.6	48.1	15.4	18.2	40.7	30	32.3	30.4	53.9	44.6	32.7	36.2	71.7	8.6	2ND	1	2016
Wisconsin	B10	40	36	129.1	93.6	0.9758	54.8	47.7	12.4	15.8	32.1	23.7	36.2	22.4	54.8	44.7	36.5	37.5	59.3	11.3	2ND	1	2015
Michigan	B10	40	33	114.4	90.4	0.9375	53.9	47.7	14	19.5	25.5	24.9	30.7	30	54.7	46.8	35.2	33.2	65.9	6.9	2ND	3	2018
Texas Tech	B12	38	31	115.2	85.2	0.9696	53.5	43	17.7	22.8	27.4	28.7	32.9	36.6	52.8	41.9	36.5	29.7	67.5	7	2ND	3	2019
Gonzaga	WCC	39	37	117.8	86.3	0.9728	56.6	41.1	16.2	17.1	30	26.2	39	26.9	56.3	40	38.2	29	71.5	7.7	2ND	1	2017
Duke	ACC	39	35	125.2	90.6	0.9764	56.6	46.5	16.3	18.6	35.8	30.2	39.8	23.9	55.9	46.3	38.7	31.4	66.4	10.7	Champions	1	2015
Virginia	ACC	38	35	123	89.9	0.9736	55.2	44.7	14.7	17.5	30.4	25.4	29.1	26.3	52.5	45.7	39.5	28.9	60.7	11.1	Champions	1	2019
North Carolina	ACC	39	33	121	91.5	0.9615	51.7	48.1	16.2	18.6	41.3	25	34.3	31.6	51	46.3	35.5	33.9	72.8	8.4	Champions	1	2017
Villanova	BE	40	35	123.1	90.9	0.9703	56.1	46.7	16.3	20.6	28

#### Another Python-esque version of the codes
This time we use a module to help us read the dataset and we will read all rows.  Note that this version retains the column heading rows.

In [29]:
#
# Python-esque version 2 
#
# use the csv module
import csv
data3 = []
with open("../Data/College Basketball Dataset.csv") as f:
    reader = csv.reader(f)
    for row in reader:
        data3.append(row)
# show the data
ShowData(data3)

TEAM	CONF	G	W	ADJOE	ADJDE	BARTHAG	EFG_O	EFG_D	TOR	TORD	ORB	DRB	FTR	FTRD	2P_O	2P_D	3P_O	3P_D	ADJ_T	WAB	POSTSEASON	SEED	YEAR
North Carolina	ACC	40	33	123.3	94.9	0.9531	52.6	48.1	15.4	18.2	40.7	30	32.3	30.4	53.9	44.6	32.7	36.2	71.7	8.6	2ND	1	2016
Wisconsin	B10	40	36	129.1	93.6	0.9758	54.8	47.7	12.4	15.8	32.1	23.7	36.2	22.4	54.8	44.7	36.5	37.5	59.3	11.3	2ND	1	2015
Michigan	B10	40	33	114.4	90.4	0.9375	53.9	47.7	14	19.5	25.5	24.9	30.7	30	54.7	46.8	35.2	33.2	65.9	6.9	2ND	3	2018
Texas Tech	B12	38	31	115.2	85.2	0.9696	53.5	43	17.7	22.8	27.4	28.7	32.9	36.6	52.8	41.9	36.5	29.7	67.5	7	2ND	3	2019
Gonzaga	WCC	39	37	117.8	86.3	0.9728	56.6	41.1	16.2	17.1	30	26.2	39	26.9	56.3	40	38.2	29	71.5	7.7	2ND	1	2017
Duke	ACC	39	35	125.2	90.6	0.9764	56.6	46.5	16.3	18.6	35.8	30.2	39.8	23.9	55.9	46.3	38.7	31.4	66.4	10.7	Champions	1	2015
Virginia	ACC	38	35	123	89.9	0.9736	55.2	44.7	14.7	17.5	30.4	25.4	29.1	26.3	52.5	45.7	39.5	28.9	60.7	11.1	Champions	1	2019
North Carolina	ACC	39	33	121	91.5	0.9615	51.7	48.1	16.2	18.

### After reading the file, check row and column number in the list of lists (all three versions)
You can find that the rows of these 3 data we read is different because we drop different amount of lines when we read our dataset. Here, for **data1**, we drop 5 lines; for **data2**, we drop 1 line,; for **data3**, we drop no lines. Make sure you understand how we achieve this.

In [30]:
ShowRowsAndCols(data1)
ShowRowsAndCols(data2)
ShowRowsAndCols(data3)

There are 1753 rows in the dataset
There are 24 columns in the dataset
There are 1757 rows in the dataset
There are 24 columns in the dataset
There are 1758 rows in the dataset
There are 24 columns in the dataset


### We can do some simple calculation with the dataset we read
Here, I just show you about calculating the mean value of the number of games played by a team during 2015-2020. 

In [31]:
sum = 0
# iterate from first row to last row
for i in data2:
    # add Australia data of every row to sum
    sum = sum + float(i[2])
mean = sum/len(data2)
mean

31.52305065452476

### Look at the column headers

In [32]:
# Use the dataset that includes the headers (data3)
data3[0]

['TEAM',
 'CONF',
 'G',
 'W',
 'ADJOE',
 'ADJDE',
 'BARTHAG',
 'EFG_O',
 'EFG_D',
 'TOR',
 'TORD',
 'ORB',
 'DRB',
 'FTR',
 'FTRD',
 '2P_O',
 '2P_D',
 '3P_O',
 '3P_D',
 'ADJ_T',
 'WAB',
 'POSTSEASON',
 'SEED',
 'YEAR']