## Input/Output

In scientific programming we usually write programs to process data. Such a program accepts some input, does a series of computations on that input, and outputs something. One example is the (fictional --- but you could write it) program `outlier.py` which takes as input a file with numbers and writes to a new file only those that are more than one standard deviation away from the mean. 

It is quite clear that the program (the thing that does the computations) is distinct from the input and output. The input is what we feed to the program, and the output is what we get out. 

In an interactive enviroment, such as IPython notebook, this clear distinction sometimes blurs, because we frequently provide the input data as part of the program. For instance, the input to the outlier detector might be written as follows:

```
numbers_to_check = [5, 1, 9, 2, 6, 4, 5, 6]
```

Although on a technical level the input here is a part of the program, it should still conceptually be regarded as input.

### Trustpilot dataset

In this notebook we will work on a dataset of users from Danish section of the review website [Trustpilot](www.trustpilot.com). For each user we registered the name, birth year, and gender, if available. The dataset consists of publically avaiable data and was collected a few months ago (June 2014). It is stored in a file called `trustpilot_names.tsv`, which is placed in the same directory as the notebooks.

Let's first have a look at the file. In IPython you can execute command line programs by putting them on a line which starts with an exclamation mark. We are going to use the utility program `head`, which displays the first lines in a file (10 by default).

In [4]:
!head trustpilot_names.tsv

name	birth_year	gender
Anders Harvest	1955	M
Gordon Steffensen	1955	M
tue	1979	M
Janni Michelsen	1960	F
Yasemin Yigen	1979	NA
Christian Harder	1991	M
Pia Rolschau Hansen	1978	F
Anders	1974	M
Gitte Nielsen	1962	F


### Reading the file

Below we are reading and processing lines from a file one at a time. The goal is to fill the three lists `names`, `ages`, and `genders` with appropriate values. The lists should be coordinated such that `names[i]` refers to the name of the *i*th person, `ages[i]` to the age of that person, and `genders[i]` to his or her gender. After reading in the file, you should check that all lists have the same length. That is,

````
len(names) == len(genders) == len(ages)
````


The line that says

````
input = open("trustpilot_names.tsv")
````

opens the file for reading and saves a reference to it in the variable `input`. 
Using the `input` variable in a `for` loop lets us examine the lines one by one. 

*Note that Python also supports opening files specifying the encoding (such as `utf-8`). Python will generally assume you want UTF-8, so you can ignore this parameter unless you positively know the file is in some other encoding.*

**Exercise** The implementation in the cell below is incomplete. To make it work correctly you need to do three things:

1. add code to skip the processing of the first line of the file, which is a header; and 
2. initialize the variables `name`, `birth_year`, and `gender` with correct values; and
3. remove the debugging code that exits the loop after 50 lines.

In [9]:
input = open("trustpilot_names.tsv")
names = []
ages = []
genders = []

i = 0
for line in input:
    name = "Nobody"
    birth_year = 2014
    gender = "NA"

    parts = line.strip().split("\t")
    names.append(name)
    ages.append(2014 - birth_year)
    genders.append(gender)

    if i == 49:
        break
    
    i += 1

**Exercise** Write a `for` loop that computes what percentage of people are male and female. Is the dataset reasonably balanced with respect to genders?

In [None]:
# Your code here

**Exercise** Use a list comprehension to construct a list of first names from the list of full names. We consider the first name to be whatever comes before the first whitespace in the full name. You can use the list comprehension below as a point of departure.

In [12]:
first_names = [full_name for full_name in names]

**Exercise** What is the most common first name. Is it male, female, or unisex? And what do you expect? 

Hint: Sort the list of first names before processing it. One way to do this is by executing `first_names.sort()` in a cell.

### Further projects

Here are some other things that you might want to try with this dataset. A word of warning: some of the projects are a bit daunting
to do with the tools you have learned so far. Once we begin to dive deeper into the data structures and base libraries of Python, however, you will discover that most can be solved efficiently with a few lines of Python.

* How many unique names are there?
* Make a list of names that are exclusively male and a list of names that are exclusively female.
* Some girl's names are formed by taking a boy's name and adding a female suffix. Can you find any examples of that? What would be an algorithmic way of finding such examples?. 
* Find the five most popular names for each decade.
* Some of the values of age are clearly not correct: we don't expect reviews from people who are four years old, nor people aged 104. Create new versions of the `first_names`, `ages`, and `genders` list with these values removed
* Consider an alternative representation of the dataset. Instead of three separate lists, you would have one list `users` with each item being a dictionary with keys `first_name`, `age`, `gender`. Which operations would be easier to perform? And which would be harder?