# 1.3.1 File Reading Example - College Basketball Dataset
Datafile: "College Basketball Dataset.csv"<br>
Datafile description:"College basketball-Description.pdf"

Updated by Jingwei Liu (2021-2-15)
<br>Updated by Jeff Smith (2022-08-16)

### Let's first define some functions to show the data we will read and help us get some basic information about the dataset.
In this example, the dataset will be read as<font color = "red"> **a list of lists** </font>, so we will define a small function to show the values in the list of lists

In [None]:
# Define a "ShowData" function 
#  dataset is a list of lists
def ShowData(dataset = [["No dataset sent"]]):
    for r in dataset:
        # print elements in a tab-separated format
        print ("\t".join(r))

# sample calls
ShowData([["one", "two", "three"], ["four", "five", "six"], ["seven", "eight", "nine"]])
#
# Note that this is a simplistic function that shows all of the data.  You would probably want
# to add some arguments that controlled how much of the data to show (e.g., the first 10 "rows").
# We'll get to that later, though.

In the above cell, we define a function named <font color = "blue">**ShowData**</font> and this function has only one input parameter which is called **dataset**. The **dataset** parameter has a default value which means the function can work even you don't manually assign a value to it. You can try it yourself.

### Also, it is always good to check the shape of the dataset you read 
*The function below will show the number of rows and columns in the list of lists.*
<br>
**Here, Row number means how many elements in the list. Column number means how many elements in each element list.**

In [None]:
# Define a "ShowRowsAndCols" function which show the number of rows and columns in the dataset
# dataset is a list of list. Row number means how many elements in the list. Column number 
# means how many elements in each element list.
def ShowShape(dataset = [["No dataset sent"]]): 
    print("There are {} rows in the dataset".format(len(dataset)))
    print("There are {} columns in the dataset".format(len(dataset[0])))
    
# sample calls
ShowShape([["one", "two", "three"], ["four", "five", "six"]])
#
# This one is also simplistic.  Later, we will see a single "show" function that shows the 
# structure information and, optionally, some or all of the data.

### Now, let's read the data set into a list of lists using three different program patterns.

Note that in the following examples, all elements in the list of lists are stored <font color = "red">**as strings**</font>, even those that look like numbers.  In Homework 2, we will convert some of these elements.

In [None]:
# Initial version - "standard programming"
#
# Define a list for the data.  Will be a list of lists.
data1 = []
# open the file
fname = "../Data/College Basketball Dataset.csv"
f = open(fname, "r")
# you can select which part of the dataset you want to read
# Ex. we can ignore the first 5 lines.  Note that we read 6 lines
# (0, 1, 2, ... 5) in this loop, but we use the last one read 
for i in range(6):
    line = f.readline()
# loop until we run out of lines.  Start the loop with the last row read.
while (line):
    # strip the newline and tokenize (split on commas, in this case)
    tokens = line.rstrip().split(',')
    # append this record to the dataset
    data1.append(tokens)
    # read the next line
    line = f.readline()
# close the file
f.close()
# Show the shape
ShowShape(data1)

In [None]:
# show the data
ShowData(data1)

After running the above cell, you should see the data is read as a list of lists. We read all rows in the dataset and each row is a list and also an element of a bigger list. **So, that's why we call this a list of lists**

Let's look at the first "row" of data (i.e., the first list in the list of lists) *(keep in mind that the subscript in python starts from 0)*:

In [None]:
data1[0]

Now, Let's try to check the value and data type of the <font color = "red">**third**</font> element of the <font color = "red">**first**</font> row (remember, since Python lists are zero-based, both indices start with 0):

In [None]:
data1[0][2], type(data1[0][2])
# Note that the output is an anonymous touple.

#### A Python-esque version of the code.
You can see in this cell, it uses fewer lines to do the same work.  For your assignment, you are free to use any of the code versions as a starting point.

In [None]:
#
# Python-esque version 1
#
# Grab all the lines from the file starting with line 1, strip
# the newline and tokenize
with open("../Data/College Basketball Dataset.csv") as f:
    data2 = [line.rstrip().split(',') for line in f.readlines()[1:]]

# Note the use of the comprehension and the list slicing here.  Both are very useful.

In [None]:
# show the data
ShowData(data2)

#### Another Python-esque version of the codes
This time we use a module to help us read the dataset and we will read all rows.  Note that this version retains the column heading rows.

In [None]:
#
# Python-esque version 2 
#
# use the csv module
import csv
data3 = []
with open("../Data/College Basketball Dataset.csv") as f:
    reader = csv.reader(f)
    for row in reader:
        data3.append(row)


In [None]:
# show the data
ShowData(data3)

### After reading the file, check row and column number in the list of lists (all three versions)
You can find that the rows of these 3 data we read is different because we drop different amount of lines when we read our dataset. Here, for **data1**, we drop 5 lines; for **data2**, we drop 1 line,; for **data3**, we drop no lines. Make sure you understand how we achieve this.

In [None]:
ShowShape(data1)
ShowShape(data2)
ShowShape(data3)

### We can do some simple calculation with the dataset we read
Here, I just show you about calculating the mean value of the number of games played by a team during 2015-2020. 

In [None]:
total = 0
# iterate from first row to last row
for i in data2:
    # add Australia data of every row to sum
    total = total + float(i[2])
mean = total/len(data2)
mean

In [None]:
# Or with a comprehension
sum([float(i[2]) for i in data2])/len(data2)

In [None]:
# Or with a more user-friendly format
print(f"")

### Look at the column headers

In [None]:
# Use the dataset that includes the headers (data3)
data3[0]