# Week1: CSV Data Analysis Review

At the end of the Python bootcamp, you were analysing csv files.  This is the first step for most data analysis.

In this lesson, we will add a few more tools to your concepts, including tuples. Then we will briefly introduce the basic tool for data analysis of multi-column datasets in Python, the pandas library.

First...

In order to read a file into the notebook, you need to know the **relative path from the current directory of the csv**. If you are using MacOS or Linux, you can see what files you have in the current directory by using "ls":

In [None]:
ls

To see which files/folders are above in the tree view:

In [None]:
ls ..

To see which files/folders are below in the tree view in a specific folder, for example in data:

In [None]:
ls data

Then find the path to the csv you want to read.
Here the path is: "data/goog.csv" 

In [None]:
with open("data/goog.csv", errors="ignore") as file:
    rowCounter = 0
    for row in file:
        print(row)
        rowCounter += 1  # add one to the counter for each row

In [None]:
print("There are", rowCounter, "rows in this data.")

In [None]:
!head data/goog.csv

What are some things we might want to know about this stock data?
Maybe:
    
* highest value
* lowest value
* largest volume
* biggest difference in a single day (high-low)

You've done some of those things in the bootcamp when you looked for longest and shortest names.  We didn't save the data when we read it in.  We need to assign it to a variable.  I'll make a function to read and return that value.

In [None]:
def read_data(filename):
    # Function takes file path, and returns a list of the rows of the data
    data = []
    with open(filename, errors="ignore") as file:
        for row in file:
            data.append(row)
    return data

In [None]:
mydata = read_data("data/goog.csv")

In [None]:
len(mydata)

In [None]:
# Let's get rid of the first row using a slice operation to skip the first item
mydata = mydata[1:]

In [None]:
mydata[0]

In [None]:
len(mydata)

Now let's make a function to find the highest value.  The question is which value do we check?  Let's just look at the High value.  But to make this general, let's just pass in the index of the column we want to check.  That way it's easy to check other columns!


In [None]:
def get_highest(data, column):
    highest = 0
    for row in data:
        # we have to split it up by the comma, to use the column index:
        vals = row.split(",")
        print(vals)
        if vals[column] > highest:
            highest = vals[column]
    return highest

In [None]:
get_highest(mydata,2)

The errors says it's a type error.  Remember, we read in text data.  We have to convert it to numbers to compare them!  There are decimals in stock data, so we want to make them floats.

In [None]:
def get_highest(data, column):
    highest = 0
    for row in data:
        vals = row.split(",")
        if float(vals[column]) > highest:
            highest = float(vals[column])
    return highest

In [None]:
get_highest(mydata, 1)

## Dictionaries Again

We did dictionaries at the end of the Python Bootcamp.  Let's review a little bit.  If we wanted to store each row in a dictionary using the date as the key, we could do it like this:

In [None]:
def make_dict(data):
    mydict = {}
    for row in data:
        vals = row.split(',')
        mydict[vals[0]] = vals[1:]
    return mydict

In [None]:
dictdata = make_dict(mydata)

In [None]:
dictdata.keys()

In [None]:
dictdata.values()

Notice there is a bad "\n" character in here. Also, notice that some of the lists don't have all the same number of elements.  Volume is missing from a bunch of dates. Let's use "strip" to clean the "\n" up and also do a bunch of list comprehensions to clean and change the data to numbers:

In [None]:
def make_dict(data):
    mydict = {}
    for row in data:
        vals = row.split(',')
        # this is a list comprehension that applies strip to remove whitespace chars including
        # \n from each value in the vals list, then reassigns the result to vals, overwriting
        # the previous messy list.
        vals = [val.strip() for val in vals]
        # another list comprehension -- now we turn everything into a floating point number,
        # except for the first value, which is the date string and except for the empty strings.
        # There is an if-test to rule out the '' empty strings!
        floats = [float(val) for val in vals[1:] if val != '']
        # now we save the floats list to the dictionary using the date, the first item, as key
        mydict[vals[0]] = floats
    return mydict

In [None]:
dictdata = make_dict(mydata)

In [None]:
dictdata.values()

Notice the results are a list of lists, but the type is a dict_values.  That means using indexing to access or slice values won't work:

In [None]:
dictdata.values()[0]

But we can convert to a list easily and then do it:

In [None]:
list(dictdata.values())[0]

How would we find which dates have closing volume and which don't? We can check the length of the lists in each key.

In [None]:
for key, value in dictdata.items():
    if len(value) == 5:
        print("Date {} has volume of {}".format(key, value[4]))

Remember that dictionaries aren't ordered. But our data is inherently ordered by date.  Luckily, there is a special type of dictionary called OrderedDict that will retain order for us.  It lives in the collections library, just like the useful Counter does.  The documentation is here: https://docs.python.org/3/library/collections.html#collections.OrderedDict

In [None]:
from collections import OrderedDict

In [None]:
# We create one using a sorted version of the dict we made, sorting by the first value in items, 
# which is the date:
ordereddata = OrderedDict(sorted(dictdata.items(), key=lambda x: x[0]))

In [None]:
type(ordereddata)

In [None]:
#Now we can verify that the keys are in order:
ordereddata.items()

That means that the data is too... so we can get all the values for the low and high and make charts if we want.  The basic charting tools in Python are Matplotlib, which is not very easy to use.  We will not do complex things with it.  Seaborn is a tool that improves your charting options a bit, and makes things look better.  As soon as you import it, it improves your chart style, without having to do anything!  You want to use these 3 lines all the time:

In [None]:
import matplotlib.pyplot as plt
import seaborn
# the following line means "put my charts inside the notebook, instead of somewhere else."
%matplotlib inline

In [None]:
highs = [val[1][1] for val in list(ordereddata.items())]

In [None]:
list(ordereddata.items())

In [None]:
highs

In [None]:
plt.plot(highs)

**Now open the *Writing and Reading Files in Python* notebook to learn more about how to deal with csv in Python **