# First steps with data

Now that we've got the basics of Python down, let's start working with some real data. We'll cover loading the data in, and dealing with all the issues it might have, storing all the data as its correct type, doing some initial analysis on it, and finally outputting some results out.

For this lesson, I've put together some sample data on movies, stored in *MovieData.csv*: it has a good variety of different data types: text (movie names and distributors), numbers (budget, revenue) and dates (release date). In the rest of this lesson, we'll walk through all the steps of analyzing this data, solving common problems as we encounter them. This is a pretty typical workflow: almost always, when working with new data, just loading it into the format that you want will be the first challenge you'll face. 

In the process, we'll cover a few important Python concepts and techniques:

* Working with strings
* Opening and reading text files
* Working with date-times
* List comprehension
* Exception-handling

## Working with Strings

Strings, or text, are a particularly important Python data type. Plain text is nearly universal across computer systems, making it a common way of storing and communicating data. Many file formats (for example, CSV, which stands for or comma-separated values, and even HTML), are really just structured text.

Not surprisingly, Python has a variety of built-in methods for working with string variables.

For example, we can convert a string into all-lower or all-upper case:

In [1]:
my_text = "CamelCase"
print(my_text.lower())
print(my_text.upper())
print(my_text)

camelcase
CAMELCASE
CamelCase


Notice that these methods return a new, modified string, but don't change the original. To do that, we have to reassign the new value to the variable, for example:

In [2]:
my_text = "CamelCase"
my_text = my_text.lower()
print(my_text)

camelcase


We can also remove unwanted characters from a string using the *.strip()* method. By default, *.strip()* only removes whitespace:

In [3]:
my_text = "  String with WhiteSpace  "
print(my_text)
print(my_text.strip())

  String with WhiteSpace  
String with WhiteSpace


If you want to remove specific characters (for example, punctuation marks) you can specify them in another string that *strip* takes as an argument, for example:

In [4]:
my_text = "    What?!?!?!"
print(my_text.strip(".,?!"))

    What


Notice that now *strip* didn't remove the whitespace. To Python, an empty space, " ", is a character like any other. If you want to include it in the list of characters to remove, you need to add it specifically:

In [5]:
my_text = "    What?!?!?! "
print(my_text.strip(".,?! ")) #  See the added whitespace at the end here

What


Internally, strings can behave like lists. For example, we can access a particular character in a string using square brackets, and get a sub-string by adding a colon:

In [6]:
my_text = "Testing"
print("the first character is:", my_text[0])
print("the last three characters are:", my_text[-3:])

the first character is: T
the last three characters are: ing


We can also get the length of a string using the *len* function:

In [7]:
print("The length of the word", my_text, "is:", len(my_text))

The length of the word Testing is: 7


One particularly important string method is *.split()*, which turns a string into an actual list, dividing it into elements according to a certain character. By default, this character is just a blank space.

In [8]:
test_string = "A B C D"
split_string = test_string.split()
print(split_string)

['A', 'B', 'C', 'D']


Notice that each element in the list is itself a string variable too.

These elements can be more than one character long. If your string contains a sentence, for example, splitting it can give you individual words:

In [9]:
my_sent = "The quick brown fox jumped over the lazy dog."
print(my_sent.split())

['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']


You can also specify a character to split on -- for example, a comma. This is especially important for working with data files, where columns are frequently comma-separated.

In [10]:
test_string = "A,B,C,D"
split_string = test_string.split(",") # Split on commas
print(split_string)

['A', 'B', 'C', 'D']


Notice that when you split on a specific character, those characters are left out of the list -- they are just treated as seperators.

## Loading data

Now that we have an idea of how to work with strings, let's see how to use that to load some input data. A rule of thumb is that just loading data into the format that you want is going to be harder than expected. Here I'm going to walk through loading a typical data file step by step, dealing with all the issues as they come up.

The first thing we need to do is open the file with the data, like this:

In [11]:
f = open("MovieData.csv")

**f** is now storing our connection to the file. Python assumes that we've opened a text file, and knows that text files are divided into lines. CSV stands for *comma-separated values* but it is the accepted extension for all tabular data stored in text files whether or not columns are separated by commas, tabs, or any other character.

Let's look at what the contents of the file look like by printing the first few rows:

In [12]:
for i in range(5):
    print(f.readline())

Release_Date	Movie	Distributor	Budget	US Gross	Worldwide Gross

03/09/12	John Carter		300000000	66439100	254439100

05/25/07	Pirates of the Caribbean: At World's End	Buena Vista	300000000	309420425	960996492

12/13/13	The Hobbit: There and Back Again	New Line	270000000	Unknown	Unknown

12/14/12	The Hobbit: An Unexpected Journey	New Line	270000000	Unknown	Unknown



There are a few things going on here. First, notice the table structure of the data: the first row consists of column headers, and the next rows are data. The columns are separated by whitespace (actually tabs, as we'll see soon). Don't forget about the column headers -- you don't want to confuse them with the data (for example, by including 'Movie' in your list of movie titles).

Now look at the code. The file object **f** internally stores what line it's on. Every time you call the readline() function, which is a built-in function of the file object it reads in the next line, then moves down a line. So if you call the function again:

In [13]:
print(f.readline())

11/24/10	Tangled	Buena Vista	260000000	200821936	586581936



it'll go down to the next line, until it gets to the end of the file. If you want to 'reset' f and return to the top of the file, you do it with the **.seek(0)**, which sets the file reader back to the beginning, like this:

In [14]:
f.seek(0)
print(f.readline())

Release_Date	Movie	Distributor	Budget	US Gross	Worldwide Gross



Notice the spacing between the blocks of text -- this suggests that the data is tab-delimited. To check, let's view one line 'raw', without being formatted by the print function:

In [15]:
f.readline()

'03/09/12\tJohn Carter\t\t300000000\t66439100\t254439100\n'

See the '\t' that appears where the gaps were before? That's the code for the tab character. Most of the time, columns in csv files are separated by commas (csv stands for "comma-separated values" after all), but not always. In this case, tabs were probably a good choice because some movie titles might have commas in the title. 

We don't want to just print rows one at a time -- we want to store them in a variable, where we can work with them. So the first thing we want to do is read the rows in and put them in a list, like this:

In [16]:
rows = [] # List to store rows:
f = open("MovieData.csv")
for row in f:
    rows.append(row)

Now **rows** is a list of rows. We can check what it's storing:

In [17]:
print(rows[0]) # Column names

Release_Date	Movie	Distributor	Budget	US Gross	Worldwide Gross



In [18]:
print(rows[133]) # Some arbitrary number

11/17/00	How the Grinch Stole Christmas	Universal	123000000	260044825	345141403



Right now **rows** is just a list of strings, with each row just one long piece of raw text -- not data we can work with yet. 

For our data, we want to split each row on the tab character, or '\t'. Then we add the split row to our list of rows:

In [19]:
rows = []
f = open("MovieData.csv")
for raw_row in f:
        row = raw_row.split("\t")
        rows.append(row)

Now **rows** is a list of lists; each item in the list is a row, and each row is a list of cells.

In [20]:
print(rows[0])
print(rows[1])

['Release_Date', 'Movie', 'Distributor', 'Budget', 'US Gross', 'Worldwide Gross\n']
['03/09/12', 'John Carter', '', '300000000', '66439100', '254439100\n']


In [21]:
print(rows[1][0]) # The first element of the second row

03/09/12


In [22]:
# We can also count the number of rows:
print(len(rows))

3628


Notice that even the numbers above have quotation marks around them, meaning that they're being read in as text. On its own, Python doesn't know the difference between 'John Carter' and '66439100' -- both are just raw text. We want to convert the numeric columns to numbers, using *casting*, which we touched on last week. We do it like this:

In [23]:
f = open("MovieData.csv")
rows = []
for raw_row in f:
    row = raw_row.split("\t")
    row[3] = int(row[3]) # Budget
    row[4] = int(row[4]) # US Gross
    row[5] = int(row[5]) # Worldwide Gross
    rows.append(row)

ValueError: invalid literal for int() with base 10: 'Budget'

Remember about the dangers of mixing the header with the content? That's what's happening here. Notice that Python shows us exactly what line of code is problem is (where the error is being thrown), and what the error is. The error *invalid literal for int()* means that we're trying to convert something invalid to an integer; Python helpfully tells us what the text is that isn't converting -- the string *Budget*, which we know is the heading in the first row. So we need to skip it.

In [24]:
f = open("MovieData.csv")
rows = []
first_row = f.readline() # Save the first row, and move to the next one
print(first_row)
for raw_row in f: # This will actually start from the second row.
    row = raw_row.split("\t")
    row[3] = int(row[3]) # Budget
    row[4] = int(row[4]) # US Gross
    row[5] = int(row[5]) # Worldwide Gross
    rows.append(row)

Release_Date	Movie	Distributor	Budget	US Gross	Worldwide Gross



ValueError: invalid literal for int() with base 10: 'Unknown'

We're still getting a similar error, except now the string that isn't converting is the word *Unknown*. It looks like when there's missing data, it isn't just left blank; instead, the table has the word "Unknown" in that space. That's pretty common; often missing records in a dataset will be filled with "Unknown" or "NA" or something similar. 

So how to get around this problem? There are a few things we can do -- we can use an **if** statement to check whether each record is equal to "Unknown" or not; or, we can catch the error using the **try: ... except:** statements.

Here's how try-except works: you specify a block of code for the computer to try and execute; if it encounters an error, instead of failing (like it did above) it goes to a block of code you've provided in case of an error. This is called Error Catching, and it looks like this:

In [25]:
# Without error catching.
text = "Not a number"
a = int(text)
print(a)

ValueError: invalid literal for int() with base 10: 'Not a number'

In [26]:
# With error catching
text = "Not a number"
try:
    # In this example, we'll never get here:
    a = int(text)
    print("The value of text is: ", a)
except:
    # Instead, we'll go here.
    a = 0
    print("The value of text cannot be converted to an integer!")
    

The value of text cannot be converted to an integer!


In [27]:
text = "100"
try:
    # This time, we'll get here...
    a = int(text)
    print("The value of text is: ", a)
except:
    # ...but not here
    a = 0
    print("The value of text cannot be converted to an integer!")
    

The value of text is:  100


Now let's implement this as we read in the movie dataset. To be on the safe side, let's wrap all the int() conversions in **try-except**, since they will probably have the same problem. If a string can't be converted to a number, we'll replace it with **None**, a special value meaning (you guessed it) a missing value. 

In [28]:
f = open("MovieData.csv")
rows = []
first_row = f.readline() # Save the first row, and move to the next one
print(first_row)
for raw_row in f: # This will actually start from the second row.
    row = raw_row.split("\t")
    row[3] = int(row[3]) # Budget
    try:
        row[4] = int(row[4]) # US Gross
    except:
        row[4] = None
    try:
        row[5] = int(row[5]) # Worldwide Gross
    except:
        row[5] = None
    rows.append(row)

Release_Date	Movie	Distributor	Budget	US Gross	Worldwide Gross



Check one of the troublesome rows:

In [29]:
rows[2]

['12/13/13',
 'The Hobbit: There and Back Again',
 'New Line',
 270000000,
 None,
 None]

Okay, looking good! However, putting try-except pairs around each line is cumbersome. In data analysis, good code and lazy code often go together, and the solution that requires the leasts repetitive typing is probably the more elegant one too. Notice how we're essentially repeating the same operation several times: converting to int when possible, otherwise assigning None. So let's wrap these all up in a function, and use it to do the conversion.

In [30]:
def convert_to_int(text):
    try:
        return int(text)
    except:
        return None

In [31]:
f = open("MovieData.csv")
rows = []
first_row = f.readline() # Save the first row, and move to the next one
print(first_row)
for raw_row in f: # This will actually start from the second row.
    row = raw_row.split("\t")
    row[3] = convert_to_int(row[3]) # Budget
    row[4] = convert_to_int(row[4]) # US Gross
    row[5] = convert_to_int(row[5]) # Worldwide Gross
    rows.append(row)

Release_Date	Movie	Distributor	Budget	US Gross	Worldwide Gross



In [32]:
rows[1000]

['03/01/91', 'The Doors', 'Sony', 40000000, 34416893, None]

Better! Same thing, but now the code is more elegant.

Notice that the code above has three lines that basically look the same, except for a single number? That's not a big deal here, but what if you wanted to convert 10 columns, or 100? So we can make the code even more compact by replacing the repeated lines with a loop:

In [33]:
f = open("MovieData.csv")
rows = []
first_row = f.readline() # Save the first row, and move to the next one
print(first_row)
for raw_row in f: # This will actually start from the second row.
    row = raw_row.split("\t")
    for i in range(3, 6): # Loop from 3 to 5
        row[i] = convert_to_int(row[i])
    rows.append(row)

Release_Date	Movie	Distributor	Budget	US Gross	Worldwide Gross



In [34]:
rows[1000]

['03/01/91', 'The Doors', 'Sony', 40000000, 34416893, None]

## Working with date-times

Notice how the date is still just a text string, though, meaning we can't make meaningful use of it for analysis. Obviously, we don't want to convert it to a number, since dates are a distinct type of data. Python offers a useful **datetime** object which stores (as the name implies) a representation of a date and time.

A bit confusingly, the **datetime** object is inside a **datetime** module, which contains other functions too. Let's import the module, and give it an alias to make it easier to access:

In [35]:
import datetime as dt

Let's start by creating a datetime object that represents the time it was created at:

In [36]:
dt.datetime.now()

datetime.datetime(2017, 5, 1, 22, 18, 13, 675042)

We can store this object in a variable, and then work with it. For example:

In [37]:
now = dt.datetime.now()
print(now)

2017-05-01 22:18:14.811658


In [38]:
print(now.year)
print(now.month)

2017
5


In [39]:
# The date's components are just numbers, so we can compare them to other numbers:
if now.year > 2000: 
    print("It's the future!")

It's the future!


In [40]:
# We can compare datetimes to each other directly too.

later = dt.datetime.now() # The new 'now'
if later > now: 
    print("Time has passed!")

Time has passed!


We can create a datetime object by specifying the exact date and time we want it to represent, like this:

In [41]:
some_date_and_time = dt.datetime(year=2005, month=1, day=1, hour=12, minute=17, second=2)
print(some_date_and_time)

2005-01-01 12:17:02


Or we can only specify a date, without a time:

In [42]:
some_date = dt.datetime(year=2010, month=1, day=1) # Happy new year!
print(some_date)

2010-01-01 00:00:00


However, we can't leave anything else blank; we need at least a year, month and day.

In [43]:
some_month = dt.datetime(year=2003, month=2)

TypeError: Required argument 'day' (pos 3) not found

We don't even need to specifically say 'year=...'; the datetime object will assume that numbers are being inputted in the order year, month, day, hour, minute, second...  It'll also throw an error if it gets a nonsensical date:

In [44]:
dt.datetime(1941, 12, 7) # This date is valid

datetime.datetime(1941, 12, 7, 0, 0)

In [45]:
# This date is not valid
dt.datetime(1931, 5, 35) #  35th of May.

ValueError: day is out of range for month

So how do we convert the date strings in our data to a datetime object. First, we have to split them, the same way as we split the rows. Let's remind ourselves what the strings look like:

In [46]:
print(rows[1000])

['03/01/91', 'The Doors', 'Sony', 40000000, 34416893, None]


Just like we split the rows into cells on the '\t' tab character, we can split the date on the '/' character, like this:

In [47]:
test_date = "03/01/91"
print(test_date.split("/"))

['03', '01', '91']


When we do a split where we know how many values we expect, we can immediately assign each to a variable, like this:

In [48]:
month, day, year = test_date.split("/") # Assign three variables at once
print(year)
print(month) 
print(day)

91
03
01


In a case like this where we know exactly how many characters each element consists of, we can treat the string like a list of characters

In [49]:
print(test_date[:2]) # First two characters, 0 and 1
print(test_date[2]) # The third character, the '/'
print(test_date[3:5]) # Characters 3 and 4
print(test_date[6:]) # Character 6 to the end

03
/
01
91


Of course, these are still strings. We need to convert them to integers, just like we did above:

In [50]:
month, day, year = test_date.split("/") # Assign three variables at once
# Convert strings to integers
month = int(month)
day = int(day)
year = int(year)

print(year, month, day)

91 3 1


Note that the datetime object requires a four-digit year, while our data only has two-digit years. If we use only two digits, Python we'll assume we mean a date in the first century.

In [51]:
test_date = dt.datetime(year, month, day)
print(test_date.year)
if test_date.year > 1900:
    print("Date in the near past")
else:
    print("Date in the far past")

91
Date in the far past


The solution is to manually add the century to the date:

In [52]:
year = year + 1900
test_date = dt.datetime(year, month, day)
print(test_date.year)
if test_date.year > 1990:
    print("Date in the near past")
else:
    print("Date in the far past")

1991
Date in the near past


However, a quick examination shows us that the data has dates on both sides of the millenium:

In [53]:
print(rows[2])

['12/13/13', 'The Hobbit: There and Back Again', 'New Line', 270000000, None, None]


In [54]:
print(rows[1000])

['03/01/91', 'The Doors', 'Sony', 40000000, 34416893, None]


So we need a cutoff to figure out what to add to the year. Let's start with 50, and then check to see whether that works. Let's also wrap all the conversion in a function, to make the code more readable:

In [55]:
def make_date(date_str):
    '''
    Turn a MM/DD/YY string into a datetime object.
    '''
    m, d, y = date_str.split("/")
    m = int(m)
    d = int(d)
    y = int(y)
    if y > 50:
        y += 1900
    else:
        y += 2000
    return dt.datetime(y, m, d)

In [56]:
# Testing:
print(make_date(rows[1000][0]))
print(make_date(rows[2][0]))

1991-03-01 00:00:00
2013-12-13 00:00:00


Now that we've tested the function, let's integrate it into the data reading. The code below should start to look pretty familiar by this point:

In [57]:
f = open("MovieData.csv")
rows = []
first_row = f.readline() # Save the first row, and move to the next one
print(first_row)
for raw_row in f: # This will actually start from the second row.
    row = raw_row.split("\t")
    row[0] = make_date(row[0]) # Convert the date string to the datetime object
    row[3] = convert_to_int(row[3]) # Budget
    row[4] = convert_to_int(row[4]) # US Gross
    row[5] = convert_to_int(row[5]) # Worldwide Gross
    rows.append(row)

Release_Date	Movie	Distributor	Budget	US Gross	Worldwide Gross



In [58]:
# Test on some arbitrary rows.
print(rows[1000])
print(rows[2022])

[datetime.datetime(1991, 3, 1, 0, 0), 'The Doors', 'Sony', 40000000, 34416893, None]
[datetime.datetime(1989, 3, 1, 0, 0), 'New York Stories', 'Buena Vista', 15000000, 10763469, None]


## List comprehension

So we actually have all the records in the format we want them; now, how do we do analysis on them? 

By and large, we want to do analysis column-wise: for example, finding the average film budget or gross -- or initially in this case, finding the range of dates. To do this here, we need to create new lists of only the rows we want -- think of it as selecting columns. 

We can do this in a for loop, by creating a new list and then iterating over all the rows, and adding only the value we're interested in to the list, like this:

In [59]:
all_dates = []
for row in rows:
    all_dates.append(row[0]) 
print(len(all_dates)) # Number of records
print(all_dates[0]) # First record

3627
2012-03-09 00:00:00


This sort of thing comes up often enough that Python provides a specific idiom for creating lists from other lists, called *list comprehension*. It puts the for loop inside of the square brackets. It works like this:

In [60]:
all_dates = [row[0] for row in rows] # List comprehension

print(len(all_dates)) # Number of records
print(all_dates[0]) # First record

3627
2012-03-09 00:00:00


As you can see, this produces results that are identical to the for loop method above in more compact code. 

Now that we have our list of dates, we can quickly use built-in functions min and max to find the start and stop points of the data.

In [61]:
print("Earliest date: ", min(all_dates))
print("Latest date: ", max(all_dates))

Earliest date:  1951-02-23 00:00:00
Latest date:  2050-05-17 00:00:00


Uh-oh, it looks like there are pre-1950s movies the dataset, so splitting on 1950 won't work. 

Let's go for something more sure: if the two-digit year is greater than 13, we'll assume it was a 20th-century year. Below 13 and it's a 21st-century year. Let's revise the code accordingly.

In [63]:
def make_date(date_str):
    '''
    Turn a MM/DD/YY string into a datetime object
    '''
    m, d, y = date_str.split("/")
    m = int(m)
    d = int(d)
    y = int(y)
    if y > 13:
        y += 1900
    else:
        y += 2000
    return dt.datetime(y, m, d)

In [64]:
f = open("MovieData.csv")
rows = []
first_row = f.readline() # Save the first row, and move to the next one
print(first_row)
for raw_row in f: # This will actually start from the second row.
    row = raw_row.split("\t")
    row[0] = make_date(row[0])
    for i in range(3, 6):
        row[i] = convert_to_int(row[i])
    rows.append(row)

Release_Date	Movie	Distributor	Budget	US Gross	Worldwide Gross



Now let's just repeat the list comprehension to get the new dates:

In [65]:
all_dates = [row[0] for row in rows]
print("Earliest date: ", min(all_dates))
print("Latest date: ", max(all_dates))

Earliest date:  1915-02-08 00:00:00
Latest date:  2013-12-13 00:00:00


That 2013-12-12 still looks suspect, since the file was created earlier than that. Are we sure that it should be 2013 and not 1913? So let's take a closer look at the row it appears in -- but first we need to find it. We can do that by iterating over all the rows:

In [66]:
target_date = max(all_dates)
for row in rows:
    if row[0] == target_date:
        print(row)

[datetime.datetime(2013, 12, 13, 0, 0), 'The Hobbit: There and Back Again', 'New Line', 270000000, None, None]


We know that The Hobbit couldn't have been released in 1913, since the book it's based on was still many years in the future. In fact, however, a quick Google search will us that 'The Hobbit: There and Back Again', was the original name for the second of two planned "Hobbit" movies, before they were made into a trilogy. This row seems to be an earlier estimate, rather than an actual released movie. That would also explain why there is no Gross data.

This is a reminder that parsing the data correctly doesn't ensure that the data is a correct reflection of the world. If something looks wrong or weird in the data, it's worth following up on and examining closely. 

For completeness's sake, let's check and see what the earliest movie is in the dataset -- to make sure we aren't recording the future release date (for example) of a movie slated for 2015:

In [67]:
target_date = min(all_dates)
for row in rows:
    if row[0] == target_date:
        print(row)

[datetime.datetime(1915, 2, 8, 0, 0), 'The Birth of a Nation', '', 110000, 10000000, 11000000]


This looks correct; a quick Google search will confirm that 'Birth of a Nation' was indeed released in 1915.

Now that we've roughly verified the dates, we can go on to analyze some of the other fields, such as the budgets:

In [68]:
all_budgets = [row[3] for row in rows]
print(len(all_budgets))

3627


If we have all the budgets, we can analyze them like we discussed last week. For example, let's recreate the function that computes the mean of a list:

In [69]:
def find_mean(num_list):
    '''
    Find the average of num_list
    '''
    total = 0.0
    for x in num_list:
        total += x
    return total / len(num_list)

In [70]:
find_mean(all_budgets)

31570720.09897987

We can also use min and max:

In [71]:
print("Min budget: ", min(all_budgets))
print("Max budget: ", max(all_budgets))

Min budget:  1100
Max budget:  300000000


## Writing to file
Finally, let's see how to output data. The first step is to open a new file to write some output to. We do it like this:

In [72]:
f = open("My_file.txt", "wb") # Open a file
f.close() # And immediately close it.

Notice the 'wb' after the comma: the 'w' means **w**rite -- it tells Python that you're opening a file for writing. By default, if the file doesn't exist, Python will create it.

**warning:** If the file *does* exist, this will overwrite its entire contents. Be careful with your file naming, since you won't get a warning that you've overwriting an existing file.

The **b** means **binary**. On Mac and Linux, this doesn't make a difference, but on Windows machines it makes sure the file only takes the exact input you give it, and doesn't add any additional characters to the end of the line like it would for a text file. It's good practice to include the **b** for cross-platform compatibility.

Now that you've opened a file, you can write to it. For practice, let's write some sample text to a new file. It's very important to close a file once you're done with it -- otherwise,the data you wrote to it may not be saved.

In [73]:
f = open("My_file.txt", "w")
f.write("Hello world.")
f.close()

Open the file up in a text editor. You'll see that it contains the "Hello world." string we put there. Let's try another one:

In [74]:
f = open("My_file.txt", "w")
f.write("Here is some text.")
f.write("And here is some more text.")
f.close()

Open the file; you'll notice that we've overwritten the previous text, and that there's no line break between the two lines we wrote to the file; we need to insert them manually, using the '\n' character, the special character for a line break. 

In [75]:
f = open("My_file.txt", "w")
f.write("Here is some text.\n")
f.write("And here is some more text.\n\nThis text should have an empty line above it.")
f.close()

Let's try to write some actual data. Suppose we have a few rows of coordinates, just some x and y values:

In [76]:
data = [
[1, 2],
[3, 5],
[2, 2],
[4, 3]
]

In [77]:
f = open("my_data.csv", "w")
for row in data:
    f.write(row)
    f.write("\n")
f.close()

TypeError: write() argument must be str, not list

As you can see, we need to translate the list into a string to be able to write it. Let's try that again:

In [78]:
f = open("my_data.csv", "w")
for row in data:
    f.write(str(row))
    f.write("\n")
f.close()

Open the file and you'll see that it's indeed written the lists, square brackets and all. Almost there, but we don't want the square brackets, just the data. So instead of converting the entire list to a string at once, we'll convert one record at a time. And let's add some column headings while we're at it.

In [79]:
f = open("my_data.csv", "w")
f.write("x, y\n") # The column headings.
for row in data:
    f.write(str(row[0])) # Write the first value
    f.write(",") # Add the comma to separate the values
    f.write(str(row[1])) # Write the second value
    f.write("\n") # Add the line break
f.close()

Now we've got pretty much exactly what we wanted. The downside is that is was cumbersome to code, and would get more so the more columns we'd add. Fortunately, Python comes with a built-in module for writing and reading csv data, called csv. Pretty easy name to remember. 

First, we need to import it:

In [80]:
import csv

To output using the csv module, we first create a file, then create a CSV Writer with the file object, like this:

In [81]:
f = open("my_data.csv", "w", newline='')

my_writer = csv.writer(f) # Create the CSV writer connected to f

**my_writer** now holds a CSV writer object; think of it as a translator from Python data to how we want the data stored in the CSV file. Unlike the basic file object, the CSV writer knows how to write entire lists as a single line of CSV data. 

You'll notice that we added something to the `open` function: `newline=''`. That tells Python not to automatically add a new-line character when a line is written, since `csv.writer` does that automatically.

In [82]:
for row in data:
    my_writer.writerow(row)
f.close()

You'll see that the file has written exactly the way we want it to, with fewer lines of code. Just one thing is missing -- the column headers.

In [83]:
f = open("my_data.csv", "w", newline='')
my_writer = csv.writer(f)

header = ["x", "y"]
my_writer.writerow(header)
for row in data:
    my_writer.writerow(row)
f.close()

The csv module can also be used as a reader, which works similarly; here, the Reader object acts as a translator from file text to Python lists.

In [84]:
f = open("my_data.csv") # For reading, not writing this time.
my_reader = csv.reader(f)
new_rows = [row for row in my_reader]
f.close()

for row in new_rows:
    print(row)

['x', 'y']
['1', '2']
['3', '5']
['2', '2']
['4', '3']
