# Lecture 8

## Reading Data



In [None]:
!ls 

In [None]:
!cat Scores.csv

## Comma Separated Values (CSV) File Format

The simplest and most common file format for storing data is called Comma Separated Values (CSV). Generally a CSV file represents a table, with the top row (first line of the file) consisting of the labels of the columns (separated by commas). Generally each column keeps a different feature or field. For example for student data, the first column could be the name, second the ID, third major, etc. After the first line, each row hold the data for one data point or example. In the case of student data, each row could correspond to one student.

### Reading CSV Files

There are lots of libraries for reading CSV files into memory. Before we start using them, lets write our own. We'll need two things:

* Means of reading and interpreting the file.
* A representation of the read data in memory.

Lets recall one of the ways in python to read a file.

In [None]:
f=open("Scores.csv","r")

first_line = f.readline()
print(first_line)

line = f.readline()
while line:
    print(line)
    line = f.readline()

f.close()

We successfully dumped the contents of the file, but we didn't 

* interpret it... each line is just a string no a camma separated list of keys or values. 
* or store it into memory... we just dumped it into the screen.

Let's start on properly interpreting the first line, which is special:

In [None]:
f=open("Scores.csv","r")
first_line = f.readline()
print(first_line.split(","))
f.close()

It appears that each line ends with `\n`. Here's how we can remove these newlines.

In [None]:
f=open("Scores.csv","r")
first_line = f.readline().rstrip()
print(first_line.split(","))
f.close()

Finally lets store the first line, which is a list of the column names:

In [None]:
f=open("Scores.csv","r")
first_line = f.readline().rstrip()
keys=first_line.split(",")
f.close()

In [None]:
keys

Now lets read the rest of the file in a similar fashion:

In [None]:
f=open("Scores.csv","r")
first_line = f.readline().rstrip()
keys=first_line.split(",")

data=list()

line = f.readline().rstrip()
while line:
    data.append(line.split(","))
    line = f.readline().rstrip()

f.close()

In [None]:
data

In [None]:
len(data[3])

We have everything in memory now, how do we retrieve it?

To associate specific keys to column numbers, we can do the following

In [None]:
fields.index("l4_1")

So the 10th student "l4_1" grade is:

In [None]:
data[10][fields.index("l4_1")]

Note that it's still a string... not a number:

In [None]:
type(data[10][fields.index("l4_1")])

## Building a CSV Reader

We have the basics down, but we have more things to consider:
* We have written some example code, we should now write something that is general and we could use in different instances. 
* The fields can be different types: strings, numbers (integer or floating point). We should store the fields as the correct data type.
* Still need to figure out how we will store the data in memory.


We have some options on how to proceed:
   * We could write a CSV reader function that given the filename of a CSV file, reads the data and returns it as a standard python data object. There are various suitable such representations, so we'll either have to pick one or provide some options to allow for other ones.
   * Instead of a CSV reader function, we could create a CSV reader class. It will be instantiated with a CSV filename, so each instance would be uniquely connected to a specific file. It'll read the data into some representation that is kept private. We provide accessor methods to get to retrieve specific parts of the data, or the whole data as standard python data.
   * We can separate the concepts of a CSV reader and how we store the data. In this way, we could write other readers (e.g. Excel file reader) that would still use the same data storage.
   * We might want to also be able to write out CSV files.

Consider the following implementation of a file reader that leaves room for supporting other file formats:

In [None]:
class DataFileHandler:
    def __init__(self,extensions):
        self.__extensions=extensions
        
    def check_extension(self,filename):
        file_extension=filename.split(".")[-1]
        return file_extension in self.__extensions

    def _readfile(self,filename):
        raise NotImplementedError    
        
    def readfile(self,filename,check_extension=True):
        if not check_extension or self.check_extension(filename):
            return self._readfile(filename)        
        else:
            print("Error: filename {} does not match acceptable extensions.".format(filename))
    
    def _writefile(self,filename,data):
        raise NotImplementedError
        
    def writefile(self,filename,data):
        return self._writefile(filename,data)
        
        
class CSVHandler(DataFileHandler):
    def __init__(self):
        #super(CSVHandler,self).__init__(["csv","CSV"])
        DataFileHandler.__init__(self,["csv","CSV"])
        
    def _readfile(self,filename):
        f=open(filename,"r")
        first_line = f.readline().rstrip()
        fields=first_line.split(",")

        data=list()

        line = f.readline().rstrip()
        while line:
            data.append(line.split(","))
            line = f.readline().rstrip()

        f.close()
        
        return fields,data
        
    

In [None]:
my_handler=CSVHandler()
fields,data=my_handler.readfile("Scores.csv")

In [None]:
type(data[0][3])

### Handling Different Types

When we read a text file, all the content is assumed to be composed of strings. Instead, ideally we would like to interpret the file by looking at each field and seeing selecting an appropriate type.

In the following example, we attempt to first convert very field to a `float`.

In [None]:
class DataFileHandler:
    def __init__(self,extensions):
        self.__extensions=extensions
        
    def check_extension(self,filename):
        file_extension=filename.split(".")[-1]
        return file_extension in self.__extensions

    def _readfile(self,filename):
        raise NotImplementedError    
        
    def readfile(self,filename,check_extension=True):
        if not check_extension or self.check_extension(filename):
            return self._readfile(filename)        
        else:
            print("Error: filename {} does not match acceptable extensions.".format(filename))
    
    def _writefile(self,filename,data):
        raise NotImplementedError
        
    def writefile(self,filename,data):
        return self._writefile(filename,data)
        
        
class CSVHandler(DataFileHandler):
    def __init__(self):
        #super(CSVHandler,self).__init__(["csv","CSV"])
        DataFileHandler.__init__(self,["csv","CSV"])
        
    def _readfile(self,filename):
        f=open(filename,"r")
        first_line = f.readline().rstrip()
        fields=first_line.split(",")

        data=list()

        line = f.readline().rstrip()
        while line:
            items=line.split(",")
            
            row=list()
            for item in items:
                try:
                    d=float(item)
                except ValueError:
                    d=item
                row.append(d)
            
            data.append(row)
            
            line = f.readline().rstrip()

        f.close()
        
        return fields,data
        
    

In [None]:
my_handler=CSVHandler()
fields,data=my_handler.readfile("Scores.csv")

In [None]:
data[0][3]

### Data Representation

Since we are using a list of lists to contain the data of the CSV file in memory, we have to do a bit of manipulation to find specific fields in the list of lists:

In [None]:
data[10][fields.index("l4_1")]

Seems like we should use a dictionary instead... recall some basics:

In [None]:
foo=dict()
foo["L1"]=1
foo["L2"]=2

foo

In [None]:
foo["L1"]

In [None]:
dict([ ("L1",1), ("L2",2) ]  )

So in principle, we can then take every row and convert it easily to a dictionary:

In [None]:
first_row=dict(list(zip(fields,data[0])))

In [None]:
first_row

In [None]:
first_row["l4_1"]

And then store the dictionary for each row in a list:

In [None]:
new_data=list()

for row in data:
    new_data.append(dict(list(zip(fields,row))))

In [None]:
new_data

In [None]:
new_data[10]["l3_4"]

### Example Data Class

Our representation of a table as a list of dictionaries isn't the most efficient, but it is convenient. 

Lets try a different approach to storing the data that custom made for storing tables with rows of data. 


In [None]:
class DataRow:
    def __init__(self,fields,data):
        self.__fields=fields
        self.__data=data
        
    def __getitem__(self,key):
        return self.__data[self.__fields.index(key)]


class Data:
    def __init__(self):
        self.__fields=list()
        self.__data=list()
        
    def set_fields(self,fields):
        self.__fields=fields
        
    def add_data_point(self,data_point):
        if isinstance(data_point,list):
            if len(data_point) == len(self.__fields):
                self.__data.append(DataRow(self.__fields,data_point))
            else:
                print("Expected {} fields, got {} fields.".format(len(self.__fields),len(fields)))
        else:
            print("Data Point must be given as a list.")

    def add_data_points(self,data_points):
        for data_point in data_points:
            self.add_data_point(data_point)
            
    def fields(self):
        return self.__fields
    
    def __getitem__(self,key):
        return self.__data[key]

    def __str__(self):
        return self.__fields

In [None]:
my_data=Data()
my_data.set_fields(fields)
my_data.add_data_points(data)

In [None]:
my_data[12]["l3_4"]

Now lets put it all together:

In [None]:
class DataFileHandler:
    def __init__(self,extensions):
        self.__extensions=extensions
        
    def check_extension(self,filename):
        file_extension=filename.split(".")[-1]
        return file_extension in self.__extensions

    def _readfile(self,filename):
        raise NotImplementedError    
        
    def readfile(self,filename,check_extension=True):
        if not check_extension or self.check_extension(filename):
            return self._readfile(filename)        
        else:
            print("Error: filename {} does not match acceptable extensions.".format(filename))
    
    def _writefile(self,filename,data):
        raise NotImplementedError
        
    def writefile(self,filename,data):
        return self._writefile(filename,data)
        
        
class CSVHandler(DataFileHandler):
    def __init__(self):
        #super(CSVHandler,self).__init__(["csv","CSV"])
        DataFileHandler.__init__(self,["csv","CSV"])
        
    def _readfile(self,filename):
        f=open(filename,"r")
        first_line = f.readline().rstrip()
        fields=first_line.split(",")

        data=list()

        line = f.readline().rstrip()
        while line:
            items=line.split(",")
            
            row=list()
            for item in items:
                try:
                    d=float(item)
                except ValueError:
                    d=item
                row.append(d)
            
            data.append(row)
            
            line = f.readline().rstrip()

        f.close()
        
        my_data=Data()
        my_data.set_fields(fields)
        my_data.add_data_points(data)
        
        return my_data
        
    

In [None]:
my_handler=CSVHandler()
my_data=my_handler.readfile("Scores.csv")

In [None]:
my_data[10]["l4_1"]

## Pandas

What we just build is very similar to Pandas...

In [None]:
import pandas as pd
Data=pd.read_csv("Scores.csv")

In [None]:
type(Data)

In [None]:
Data

In [None]:
dir(Data)

In [None]:
Data[Data["l4_2"]==10]

In [None]:
Data.columns

## Indexing

We implemented `__getitem__` in our data classes to enable easy access of our data. Lets take a closer look:

In [None]:
class my_list:
    def __init__(self,a_list):
        self._list=a_list
        
    def __getitem__(self,key):
        print(key)
        

We get a lot of nice functionality... but not everything:

In [None]:
obj = my_list([5,5,5])

obj[1]
obj[1,2]
obj[1,2,3]
obj[1:2]

Note slicing results in a `slice` object not the slice of the data. Let's look at `slice` closer:

In [None]:
slice(1,2,3)

In [None]:
slice(1,2,3).start

In [None]:
obj[:1]

We can detect in `__getitem__` when we get a slice object and use it accordingly:

In [None]:
class my_list:
  def __init__(self,a_list):
    self._list=a_list

  def __getitem__(self, key):
    if isinstance(key, slice):
        start = key.start or 0
        stop = key.stop or len(self._list)
        step = key.step or 1        
        return [self._list[i] for i in range(start, stop, step)]
    elif isinstance(key, int):
        return self._list[key]
    elif isinstance(key, tuple):
        raise NotImplementedError
    else:
        raise TypeError 

In [None]:
my_list([1,2,3,4,5,6,7])[2:6:1]

What if we want to do something more complicated, like handle both `M[i][j]` and `M[i,j]` in the same way?

In [None]:
class DataRow:
    def __init__(self,fields,data):
        self.__fields=fields
        self.__data=data
        
    def __getitem__(self,key):
        return self.__data[self.__fields.index(key)]


class Data:
    def __init__(self):
        self.__fields=list()
        self.__data=list()
        
    def set_fields(self,fields):
        self.__fields=fields
        
    def add_data_point(self,data_point):
        if isinstance(data_point,list):
            if len(data_point) == len(self.__fields):
                self.__data.append(DataRow(self.__fields,data_point))
            else:
                print("Expected {} fields, got {} fields.".format(len(self.__fields),len(fields)))
        else:
            print("Data Point must be given as a list.")

    def add_data_points(self,data_points):
        for data_point in data_points:
            self.add_data_point(data_point)
            
    def fields(self):
        return self.__fields
    
    def __getitem__(self,key):
        return self.__data[key]

    def __str__(self):
        return self.__fields

## Lazy Evaluation

Matrix multiplication can be a time consuming operation. What if we only need a few elements of the result of a matrix multiplication? Can we some how only compute the elements we need only? This is where the concept of lazy evaluation can come in handy.

Recall the matrix multiplication formula:

 $C=A \cdot B$: $C_{ij} = \sum_{k} A_{ik} B_{kj}$.
 
Note that we actually compute every element of the resultant matrix independently, but in a typical implemenation of multiplication, we'll loop over all elements of resultant matrix:

In [None]:
def zero_matrix(m,n):
    return [ [0 for _ in range(m)] for _ in range(n)]

def is_matrix(M):
    if isinstance(M,list):
        row_length=len(M[0])
        for row in M:
            if not row_length==len(row):
                return False
    else:
        False
    return True
        

def matrix_shape(M):
    if is_matrix(M):
        m=len(M)
        n=len(M[0])
        return m,n
    else:
        0,0

def matrix_multiply(M1,M2):
    m1,n1=matrix_shape(M1)
    m2,n2=matrix_shape(M2)
    
    if n1==m2:
        
        M3=zero_matrix(m1,n2)
        
        # Loop over ALL elements of the resultant matrix
        for i in range(m1):
            for j in range(n2):
                for k in range(n1):
                    M3[i][j]+=M1[i][k]*M2[k][j]
        return M3
    
    return False

In [None]:
M1 =  [ [ 1, 2 ] , [ 2, 3 ] ]
M2 =  [ [ 1, 2 ] , [ 2, 3 ] ]

matrix_multiply(M1,M2)
    

Instead we can create a new matrix class that holds the results of a product of two matrices... and only computes the elements it needs.

In [None]:
class lazy_multiplied_matrix:
    def __init__(self,M1,M2):
        m1,n1=matrix_shape(M1)
        m2,n2=matrix_shape(M2)
        
        assert(n1==m2)
        
        self._n1=n1
        
        self._m=m1
        self._n=n2
        
        self._M1=M1
        self._M2=M2

        # By default the resultant matrix will be composed of Nones, 
        # indicating that no element is computed.
        
        self._M3= [ [None for _ in range(self._m)] for _ in range(self._n)]
        
    def __getitem__(self,key):
        if isinstance(key,tuple):
            i,j=key
        else:
            return None

        if self._M3[i][j]:
            return self._M3[i][j]
        else:
            self._M3[i][j]=0.

            for k in range(self._n1):
                self._M3[i][j]+=self._M1[i][k]*self._M2[k][j]
                
            return self._M3[i][j]
        
    def __str__(self):
        return str(self._M3)
    
    __repr__ = __str__
    
                    

In [None]:
M3=lazy_multiplied_matrix(M1,M2)

In [None]:
M3

In [None]:
M3[1,1]

In [None]:
M3