# Introduction to Python for Biology
# Day 2

# Code Along

## Special Characters

If we want to add a new line when we print something out, we'll have to use a special character `\n`

In [55]:
print("Hi\nBye")

Hi
Bye


There are some other special characters that we use when working with strings in Python. They all have the backslash followed by a letter. For example, you may also use the `\t` (tab) character. Note that we don't have to add spaces in between.

In [2]:
print("This line is not tabbed")
print("\tThis line is tabbed")

This line is not tabbed
	This line is tabbed


## String Manipulation

If we want to glue together two strings, we can **concatenate** them using the `+` (plus) symbol.

In [5]:
first_name = "Nichole"
last_name = "Bennett"
full_name = first_name + " " + last_name
print(full_name)

Nichole Bennett


We often concatenate in a print statement (you can sometimes also use commas).

In [3]:
name = "Nichole"
print("Hey " + name + " welcome to class")

Hey Nichole welcome to class


Like we saw with lists yesterday, strings have a lot of built-in methods (functions that go along with them). Remember string methods show up after the name of the string variable and have parentheses after them. 


For example, there is a method `.lower()` that will change a string to all lowercase letters. This doesn't change the original variable. It returns a copy of the variable in lower case (that you can save to a new variable).

In [8]:
program = "PYTHON"
print(program.lower())
lower_program = program.lower()
print(program)
print(lower_program)

python
PYTHON
python


In [9]:
lower_program.upper()

'PYTHON'

Another useful string manipulation method is `.replace()`. It takes two arguments (both strings) and will return a copy of the original variable (so save it if you want to use it again).

In [10]:
word = "colour"
word.replace("u", "")

'color'

Remember how we pulled out items from a list yesterday using their indices? We can do the same with a string to extract a substring. Remember that Python starts counting at 0.

In [12]:
rainbow = "ROYGBIV"
rainbow[2:6]

'YGBI'

If we don't include a second number, we'll get all of the letters up until the end of the string.

In [14]:
rainbow[3:]

'GBIV'

Often we will need to count the number of times some pattern in a string occurs in biology. `.count()` can help us count how many times a substring occurs in a string. It takes the substring as an argument and returns a number.

In [15]:
train = "chugachugachugachugachugachugachugachugachugachugachugachoochoo"

In [16]:
train.count("chuga")

11

If we want to find the location of the substrings, we can use `.find()`. It takes a single string argument and returns a number that is the position that substring first appears in the string. 

In [17]:
print(train.find("u"))

2


In [19]:
print(train.find("y"))

-1


Both `.count()` and `.find()` can only find exact matches. This doesn't work great for variable site pattern searches, but we'll learn regular expressions later (which will help us with that).

Another thing that we might want to do with strings is split them up into pieces. We can split a string into items in a list using `.split()` and then be able to iterate over it. `.split()` takes a single argument which is the character we want to split on (we call this the **delimiter.**)

In [22]:
words = "red,green,blue,yellow"

In [23]:
colours = words.split(",")

In [24]:
print(colours)

['red', 'green', 'blue', 'yellow']


## Reading Text from a File

As biologists, we often need to read in text from a file as part of a pipeline. Let's learn how to use Python to interact with files we have.

What kinds of text files do you use in your work? How is the data formatted? 

Before we can read a file, we have to open it. This creates a file object that we can give a variable name. 

In [121]:
my_file = open("data/hemoglobin.txt")

Once we've opened the file, we can read it and then treat it sort of like a string. These file contents are different than the file object and from the name of the file. Confusing these is a common cause of errors.

In [51]:
file_contents = my_file.read()

In [52]:
print(file_contents)

>sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens OX=9606 GN=HBA1 PE=1 SV=2
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
AVHASLDKFLASVSTVLTSKYR
 233


We have newlines at the end of a file we read in, and we can strip them off using the `.rstrip()` method which takes the character you'd like to remove as its argument.

In [122]:
my_file = open("data/hemoglobin.txt")
my_file_contents = my_file.read()

# remove the newline from the end of the file contents
hemo = my_file_contents.rstrip("\n")

print(hemo)

>sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens OX=9606 GN=HBA1 PE=1 SV=2
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
AVHASLDKFLASVSTVLTSKYR


Commonly, we'll do this all in one line. You can string together Python methods. 

What is the best way to write it? The easiest way for you to understand and read it. If it makes more sense for you to write it out line by line for readability, go ahead. I often write my code line by line to start with and shorten it up in future passes. 

In [53]:
hemo = my_file.read().rstrip("\n")

## Iterating Over Lines of Text in a File

Remember loops? We can treat file objects like lists and loop over them, with every line as an individual element. This is super useful if we need to process a file line by line.

Make sure you loop over the file object, not the contents of the file (that you got from `.read()`. You'll know you've messed this up if you just get a single character for each line. It is helpful to ask yourself if you want to read your file in as one big chunk (in which case you use use `.read()`) or if you want to read your file in line by line (in which case you should loop over the file object).

In [None]:
file = open("some_input.txt")
for line in file:
    # do something with the line

Another thing to watch out for is looping over the same file object twice. You may have run into this before if you tried to rerun code from above because file objects are exhaustible. Python remembers that it is at the end of the file once you've looped over it, so it lets you know there are no more lines. You can close and reopen the file if you want to loop over it again or (better idea) you can read the contents into a list and iterate over the list multiple times without a problem. 

The `.readlines()` method, which is used on file objects, will read the lines of a file into a list.

First we will store a list of lines in the file.

In [123]:
my_file = open("data/hemoglobin.txt")
all_lines = my_file.readlines()

Then we can do stuff with the list by looping over it.

In [2]:
for line in all_lines:
    print("The first character is " + line[2])

The first character is p
The first character is L
The first character is V
The first character is H


In [3]:
for line in all_lines:
    print("The length is " + str(len(line)))

The length is 88
The length is 61
The length is 61
The length is 23


## Writing to a File

Let's take a moment and look at the Python documentation (either by Googling or by using cmd/ctrl + tab) to try to figure out how to use the `open()` function to write to a file. 

We see that we can use the second (optional) argument version of the `open()` function and use "w" for writing. 

This second argument can be "r" for reading (it is this by default if we leave it off), "w" for writing", or "a" for appending. "w" will overwrite an existing file, while "a" will add new data to the end of the file without removing content. (If the file doesn't exist, both "w" and "a" do the same thing). 

In [4]:
new_file = open("out.txt", "w")

Now that we've opened a file for writing, we can use the `.write()` method to write some text to it. This method is a lot like print and takes a string as an argument. (It can also take any function that would return a string.)

In [5]:
new_file.write("Transmitting Science")

20

If we check the folder we are currently working in, we can see we now have a new file with the name we gave it. Let's open it and check the file contents. 

## Closing Files

We'll also need to call the `.close()` method on the file when we are done reading to it or writing to it. (Note that `.close()` is a method and `open()` is a function). This will be a good habit to have and will prevent errors that are hard to track down. 

In [6]:
my_file.close()

In [7]:
new_file.close()

## Pandas Library

* A data analysis library — **Pan**el **Da**ta **S**ystem.
* Created by Wes McKinney in 2009.
* Implemented in highly optimized Python/Cython.
* Like Excel or R for Python!

### Pandas is used for

* Cleaning data/munging.
* Exploratory analysis.
* Structuring data for plots or tabular display.
* Joining disparate sources.
* Modeling.
* Filtering, extracting, or transforming.

### Importing Pandas

Import Pandas at the top of your notebook. Give it the nickname **pd** so you don't have to keep typing "pandas." (But you can nickname it anything or leave out the nickname)

In [3]:
import pandas as pd

### Loading a CSV as a DataFrame

Pandas can load many types of files, but one of the most common types is .csv (comma separated values).

In [4]:
titanic = pd.read_csv('data/titanic.csv')

This creates a Pandas object called a **DataFrame.**  

DataFrames are powerful containers that have lots of built-in functions for exploring and manipulating your data. 

### Exploring the data using DataFrames

#### Use .head() to examine the top of the DataFrame

In [5]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
titanic.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### Use .tail() to examine the bottom

In [7]:
titanic.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


#### The .shape property will tell you how many rows and columns you have

In [8]:
titanic.shape

(891, 12)

#### You can look up the names of your columns using the .columns property.

In [9]:
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

#### You can transpose the data using .T
Note that we don't affect the original variable in this way.

In [10]:
titanic.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,881,882,883,884,885,886,887,888,889,890
PassengerId,1,2,3,4,5,6,7,8,9,10,...,882,883,884,885,886,887,888,889,890,891
Survived,0,1,1,1,0,0,0,0,1,1,...,0,0,0,0,0,0,1,0,1,0
Pclass,3,1,3,1,3,3,1,3,3,2,...,3,3,2,3,3,2,1,3,1,3
Name,"Braund, Mr. Owen Harris","Cumings, Mrs. John Bradley (Florence Briggs Th...","Heikkinen, Miss. Laina","Futrelle, Mrs. Jacques Heath (Lily May Peel)","Allen, Mr. William Henry","Moran, Mr. James","McCarthy, Mr. Timothy J","Palsson, Master. Gosta Leonard","Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)","Nasser, Mrs. Nicholas (Adele Achem)",...,"Markun, Mr. Johann","Dahlberg, Miss. Gerda Ulrika","Banfield, Mr. Frederick James","Sutehall, Mr. Henry Jr","Rice, Mrs. William (Margaret Norton)","Montvila, Rev. Juozas","Graham, Miss. Margaret Edith","Johnston, Miss. Catherine Helen ""Carrie""","Behr, Mr. Karl Howell","Dooley, Mr. Patrick"
Sex,male,female,female,female,male,male,male,male,female,female,...,male,female,male,male,female,male,female,female,male,male
Age,22,38,26,35,35,,54,2,27,14,...,33,22,28,25,39,27,19,,26,32
SibSp,1,1,0,1,0,0,0,3,0,1,...,0,0,0,0,0,0,0,1,0,0
Parch,0,0,0,0,0,0,0,1,2,0,...,0,0,0,0,5,0,0,2,0,0
Ticket,A/5 21171,PC 17599,STON/O2. 3101282,113803,373450,330877,17463,349909,347742,237736,...,349257,7552,C.A./SOTON 34068,SOTON/OQ 392076,382652,211536,112053,W./C. 6607,111369,370376
Fare,7.25,71.2833,7.925,53.1,8.05,8.4583,51.8625,21.075,11.1333,30.0708,...,7.8958,10.5167,10.5,7.05,29.125,13,30,23.45,30,7.75


#### You can access a specific column with bracket syntax (like with dictionaries) using the column's string name.

In [16]:
titanic['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

#### You can also access it using dot notation. (When might this not work?)

In [17]:
titanic.Name

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [18]:
titanic.Name.head()

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

Notice that this looks a little different than our DataFrame above. That is because it is a Series object. It's a little different than a Dataframe. 

**What's the difference between Pandas' Series and DataFrame objects?**  
Essentially, a Series object contains the data for a single column, and a DataFrame object is a matrix-like container for those Series objects that comprise your data. They mostly act like one another, but occasionaly you'll run into methods that only work for one.

#### Examining Your Data With .info()  
Provides information about:

* The name of the column/variable attribute.
* The type of index (RangeIndex is default).
* The count of non-null values by column/attribute.
* The type of data contained in the column/attribute.
* The unqiue counts of dtypes (pandas data types).
* The memory usage of our data set.

In [0]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Types affect the way data is represented in machine learning models, whether we can apply math operators to them, etc.   

Some common problems with working with a new dataset:  
* Missing values.
* Unexpected types (string/object instead of int/float).
* Dirty data (commas, dollar signs, unexpected characters, etc.).
* Blank values that are actually "non-null" or single white-space characters.

#### Summarize the data with .describe()
It gives us the following statistics:

* Count, which is equivalent to the number of cells (rows).
* Mean, or, the average of the values in the column.
* Std, which is the standard deviation.
* Min, a.k.a., the minimum value.
* 25%, or, the 25th percentile of the values.
* 50%, or, the 50th percentile of the values ( which is the equivalent to the median).
* 75%, or, the 75th percentile of the values.
* Max, which is the maximum value.  

Let's try this on a single column as well as the entire dataframe.

In [19]:
titanic['Age'].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [20]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


There are also built-in math functions that will work on all columns of a DataFrame at once, as well as subsets of the data.

#### For example, I can use the .mean() function on the titanic DataFrame to get the mean for every column.

In [21]:
titanic.mean()

PassengerId    446.000000
Survived         0.383838
Pclass           2.308642
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64

### Reading in trickier file types

This worked well above because we `.read_csv()` expected a comma-separated file with a header row. What happens when these don't match?

In [22]:
golf = pd.read_csv('data/playgolf.csv')

In [23]:
golf.head()

Unnamed: 0,07-01-2014|sunny|85|85|false|Don't Play
0,07-02-2014|sunny|80|90|true|Don't Play
1,07-03-2014|overcast|83|78|false|Play
2,07-04-2014|rain|70|96|false|Play
3,07-05-2014|rain|68|80|false|Play
4,07-06-2014|rain|65|70|true|Don't Play


What happened here? Let's Google `pandas .read_csv` to look at the documentation and troubleshoot. 

In [24]:
golf = pd.read_csv('data/playgolf.csv', sep = '|')

In [25]:
golf.head()

Unnamed: 0,07-01-2014,sunny,85,85.1,false,Don't Play
0,07-02-2014,sunny,80,90,True,Don't Play
1,07-03-2014,overcast,83,78,False,Play
2,07-04-2014,rain,70,96,False,Play
3,07-05-2014,rain,68,80,False,Play
4,07-06-2014,rain,65,70,True,Don't Play


We fixed part of the problem, but we still need pandas to understand we don't have a header in this file.

In [26]:
golf_cols = ["Date", "Outlook", "Temperature", "Humidity", "Windy", "Result"]
golf = pd.read_csv('data/playgolf.csv', sep = '|', header = None, names = golf_cols)

In [27]:
golf.head()

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result
0,07-01-2014,sunny,85,85,False,Don't Play
1,07-02-2014,sunny,80,90,True,Don't Play
2,07-03-2014,overcast,83,78,False,Play
3,07-04-2014,rain,70,96,False,Play
4,07-05-2014,rain,68,80,False,Play


The `skiprows` and `skipfooter` arguments may also be useful if you have collaborators who make extra notes in their data files that you need to ignore.

## Pandas Indexing

#### Let's read in the drug dataset for practicing indexing

In [28]:
drug = pd.read_csv("data/drug.csv")
drug.head()

Unnamed: 0,age,n,alcohol-use,alcohol-frequency,marijuana-use,marijuana-frequency,cocaine-use,cocaine-frequency,crack-use,crack-frequency,...,oxycontin-use,oxycontin-frequency,tranquilizer-use,tranquilizer-frequency,stimulant-use,stimulant-frequency,meth-use,meth-frequency,sedative-use,sedative-frequency
0,12,2798,3.9,3.0,1.1,4.0,0.1,5.0,0.0,-,...,0.1,24.5,0.2,52.0,0.2,2.0,0.0,-,0.2,13.0
1,13,2757,8.5,6.0,3.4,15.0,0.1,1.0,0.0,3.0,...,0.1,41.0,0.3,25.5,0.3,4.0,0.1,5.0,0.1,19.0
2,14,2792,18.1,5.0,8.7,24.0,0.1,5.5,0.0,-,...,0.4,4.5,0.9,5.0,0.8,12.0,0.1,24.0,0.2,16.5
3,15,2956,29.2,6.0,14.5,25.0,0.5,4.0,0.1,9.5,...,0.8,3.0,2.0,4.5,1.5,6.0,0.3,10.5,0.4,30.0
4,16,3058,40.1,10.0,22.5,30.0,1.0,7.0,0.0,1.0,...,1.1,4.0,2.4,11.0,1.8,9.5,0.3,36.0,0.2,3.0


A common task is that we'll want to operate on a specific portion of our data. With indexing, we can pull out a specific part of our DataFrame.  

pandas has three properties you can use for indexing:

* **.loc** indexes with the labels for rows and columns.
* **.iloc** indexes with the integer positions for rows and columns. 

#### Using the .loc indexer, let's pull out row 0 and all columns `dataframe.loc[rows, columns]`

In [29]:
drug.loc[0, :]

age                          12
n                          2798
alcohol-use                 3.9
alcohol-frequency             3
marijuana-use               1.1
marijuana-frequency           4
cocaine-use                 0.1
cocaine-frequency           5.0
crack-use                     0
crack-frequency               -
heroin-use                  0.1
heroin-frequency           35.5
hallucinogen-use            0.2
hallucinogen-frequency       52
inhalant-use                1.6
inhalant-frequency         19.0
pain-releiver-use             2
pain-releiver-frequency      36
oxycontin-use               0.1
oxycontin-frequency        24.5
tranquilizer-use            0.2
tranquilizer-frequency       52
stimulant-use               0.2
stimulant-frequency           2
meth-use                      0
meth-frequency                -
sedative-use                0.2
sedative-frequency           13
Name: 0, dtype: object

#### What if I want multiple rows? Let's get rows 0, 1, and 2 by passing in a list

In [30]:
drug.loc[[0,1,2], :]

Unnamed: 0,age,n,alcohol-use,alcohol-frequency,marijuana-use,marijuana-frequency,cocaine-use,cocaine-frequency,crack-use,crack-frequency,...,oxycontin-use,oxycontin-frequency,tranquilizer-use,tranquilizer-frequency,stimulant-use,stimulant-frequency,meth-use,meth-frequency,sedative-use,sedative-frequency
0,12,2798,3.9,3.0,1.1,4.0,0.1,5.0,0.0,-,...,0.1,24.5,0.2,52.0,0.2,2.0,0.0,-,0.2,13.0
1,13,2757,8.5,6.0,3.4,15.0,0.1,1.0,0.0,3.0,...,0.1,41.0,0.3,25.5,0.3,4.0,0.1,5.0,0.1,19.0
2,14,2792,18.1,5.0,8.7,24.0,0.1,5.5,0.0,-,...,0.4,4.5,0.9,5.0,0.8,12.0,0.1,24.0,0.2,16.5


#### Can you think of a more efficient way to do this?

In [31]:
drug.loc[0:2, :]

Unnamed: 0,age,n,alcohol-use,alcohol-frequency,marijuana-use,marijuana-frequency,cocaine-use,cocaine-frequency,crack-use,crack-frequency,...,oxycontin-use,oxycontin-frequency,tranquilizer-use,tranquilizer-frequency,stimulant-use,stimulant-frequency,meth-use,meth-frequency,sedative-use,sedative-frequency
0,12,2798,3.9,3.0,1.1,4.0,0.1,5.0,0.0,-,...,0.1,24.5,0.2,52.0,0.2,2.0,0.0,-,0.2,13.0
1,13,2757,8.5,6.0,3.4,15.0,0.1,1.0,0.0,3.0,...,0.1,41.0,0.3,25.5,0.3,4.0,0.1,5.0,0.1,19.0
2,14,2792,18.1,5.0,8.7,24.0,0.1,5.5,0.0,-,...,0.4,4.5,0.9,5.0,0.8,12.0,0.1,24.0,0.2,16.5


Note that .loc is inclusive on both sides. This is different than the behavior of some other Python functions, like `range`

#### Let's do the same thing for columns and just select the `sedative-use` and `sedative-frequency` column

In [32]:
drug.loc[:, 'sedative-use':'sedative-frequency']

Unnamed: 0,sedative-use,sedative-frequency
0,0.2,13.0
1,0.1,19.0
2,0.2,16.5
3,0.4,30.0
4,0.2,3.0
5,0.5,6.5
6,0.4,10.0
7,0.3,6.0
8,0.5,4.0
9,0.3,9.0


#### We can pull out rows and columns. Let's pull out rows 0 through 2 and `sedative-use` and `sedative-frequency` columns.

In [33]:
drug.loc[0:2, 'sedative-use':'sedative-frequency']

Unnamed: 0,sedative-use,sedative-frequency
0,0.2,13.0
1,0.1,19.0
2,0.2,16.5


#### We can do the same thing with the .iloc indexer. This time we use integers for the position.  Let's get all rows and columns in position 0 and 3.

In [34]:
drug.iloc[:,[0,3]]

Unnamed: 0,age,alcohol-frequency
0,12,3.0
1,13,6.0
2,14,5.0
3,15,6.0
4,16,10.0
5,17,13.0
6,18,24.0
7,19,36.0
8,20,48.0
9,21,52.0


#### Let's get all of the rows and columns 0 through 4 using `.iloc`

In [35]:
drug.iloc[:, 0:4]

Unnamed: 0,age,n,alcohol-use,alcohol-frequency
0,12,2798,3.9,3.0
1,13,2757,8.5,6.0
2,14,2792,18.1,5.0
3,15,2956,29.2,6.0
4,16,3058,40.1,10.0
5,17,3038,49.3,13.0
6,18,2469,58.7,24.0
7,19,2223,64.6,36.0
8,20,2271,69.7,48.0
9,21,2354,83.2,52.0


Note that `.iloc` is inclusive of the first number but exclusive of the second number. This is more like `range`.

#### Let's get the first four rows and the first two columns

In [36]:
drug.iloc[0:4, 0:2]

Unnamed: 0,age,n
0,12,2798
1,13,2757
2,14,2792
3,15,2956


### Creating DataFrames

You can create your own DataFrame without importing data from a file using pd.DataFrame() on a dictionary.  
Make sure the dictionary has lists of values that are all the same length. The keys correspond to the names of the columns, and the values correspond to the data in the columns.

In [37]:
mydata = pd.DataFrame({'Letters':['A','B','C'], 'Integers':[1,2,3], 'Floats':[2.2, 3.3, 4.4]})
mydata

Unnamed: 0,Letters,Integers,Floats
0,A,1,2.2
1,B,2,3.3
2,C,3,4.4


#### Examine the data types

Use .dtypes on your DataFrame.  

In [38]:
mydata.dtypes

Letters      object
Integers      int64
Floats      float64
dtype: object

Strings are stored as a type called "object," as they are not guaranteed to take up a set amount of space (strings can be any length).

#### Rename columns

Change the column name Integers to int:

In [39]:
mydata.rename(columns={'Integers':'Ints'},inplace=True)
mydata

Unnamed: 0,Letters,Ints,Floats
0,A,1,2.2
1,B,2,3.3
2,C,3,4.4


Why did we have to use `inplace` this time? Let's check the documentation. See that `inplace=False` is the default for this method. It's Pandas way of trying to protect us. 

#### Rename all of the columns by assigning a list to the .columns property

In [40]:
mydata.columns=['A','B','C']
mydata

Unnamed: 0,A,B,C
0,A,1,2.2
1,B,2,3.3
2,C,3,4.4


## Changing data types

Load the `drinks.csv` data.

In [43]:
drinks = pd.read_csv("data/drinks.csv")
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


#### Check the datatypes of the dataframe

In [44]:
drinks.dtypes

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

#### Change the datatype of the `beer_servings` column to floating point

In [45]:
drinks.beer_servings = drinks.beer_servings.astype(float)

In [46]:
drinks.dtypes

country                          object
beer_servings                   float64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

## Filtering and Sorting DataFrames

#### Filter drinks to include only European countries.

First we create a series of Booleans

In [47]:
drinks.continent=='EU'

0      False
1       True
2      False
3       True
4      False
       ...  
188    False
189    False
190    False
191    False
192    False
Name: continent, Length: 193, dtype: bool

Then we can use this series to filter our dataframe. (This is why we see the `drinks` twice.)

In [48]:
drinks[drinks.continent=='EU']

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
1,Albania,89.0,132,54,4.9,EU
3,Andorra,245.0,138,312,12.4,EU
7,Armenia,21.0,179,11,3.8,EU
9,Austria,279.0,75,191,9.7,EU
10,Azerbaijan,21.0,46,5,1.3,EU
15,Belarus,142.0,373,42,14.4,EU
16,Belgium,295.0,84,212,10.5,EU
21,Bosnia-Herzegovina,76.0,173,8,4.6,EU
25,Bulgaria,231.0,252,94,10.3,EU
42,Croatia,230.0,87,254,10.2,EU


#### Filter drinks to include only European countries with wine_servings > 300.

In [49]:
drinks[(drinks.continent=='EU') & (drinks.wine_servings > 300)]

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
3,Andorra,245.0,138,312,12.4,EU
61,France,127.0,151,370,11.8,EU
136,Portugal,194.0,67,339,11.0,EU


#### Filter drinks to include only countries with wine_servings > 300 or beer_servings > 300.

In [50]:
drinks[(drinks.beer_servings > 300) | (drinks.wine_servings > 300)]

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
3,Andorra,245.0,138,312,12.4,EU
45,Czech Republic,361.0,170,134,11.8,EU
61,France,127.0,151,370,11.8,EU
62,Gabon,347.0,98,59,8.9,AF
65,Germany,346.0,117,175,11.3,EU
81,Ireland,313.0,118,165,11.4,EU
98,Lithuania,343.0,244,56,12.9,EU
117,Namibia,376.0,3,1,6.8,AF
129,Palau,306.0,63,23,6.9,OC
135,Poland,343.0,215,56,10.9,EU


#### If we find ourselves gluing together a bunch of "OR" statements, we can use `.isin` to create a boolean series to pass into the dataframe

In [51]:
drinks[(drinks.continent=='EU') | (drinks.continent=='AF') | (drinks.continent=='OC')]

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
1,Albania,89.0,132,54,4.9,EU
2,Algeria,25.0,0,14,0.7,AF
3,Andorra,245.0,138,312,12.4,EU
4,Angola,217.0,57,45,5.9,AF
7,Armenia,21.0,179,11,3.8,EU
...,...,...,...,...,...,...
182,United Kingdom,219.0,126,195,10.4,EU
183,Tanzania,36.0,6,1,5.7,AF
187,Vanuatu,21.0,18,11,0.9,OC
191,Zambia,32.0,19,4,2.5,AF


In [52]:
drinks.continent.isin(['EU', 'AF', 'OC'])

0      False
1       True
2       True
3       True
4       True
       ...  
188    False
189    False
190    False
191     True
192     True
Name: continent, Length: 193, dtype: bool

In [53]:
drinks[drinks.continent.isin(['EU', 'AF', 'OC'])]

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
1,Albania,89.0,132,54,4.9,EU
2,Algeria,25.0,0,14,0.7,AF
3,Andorra,245.0,138,312,12.4,EU
4,Angola,217.0,57,45,5.9,AF
7,Armenia,21.0,179,11,3.8,EU
...,...,...,...,...,...,...
182,United Kingdom,219.0,126,195,10.4,EU
183,Tanzania,36.0,6,1,5.7,AF
187,Vanuatu,21.0,18,11,0.9,OC
191,Zambia,32.0,19,4,2.5,AF


#### Calculate the mean beer_servings for all of Europe.

In [54]:
drinks[drinks.continent=='EU'].beer_servings.mean()

193.77777777777777

#### Determine which 10 countries have the highest total_litres_of_pure_alcohol.

In [55]:
drinks.sort_values('total_litres_of_pure_alcohol').tail(10)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
99,Luxembourg,236.0,133,271,11.4,EU
155,Slovakia,196.0,293,116,11.4,EU
81,Ireland,313.0,118,165,11.4,EU
141,Russian Federation,247.0,326,73,11.5,AS
61,France,127.0,151,370,11.8,EU
45,Czech Republic,361.0,170,134,11.8,EU
68,Grenada,199.0,438,28,11.9,
3,Andorra,245.0,138,312,12.4,EU
98,Lithuania,343.0,244,56,12.9,EU
15,Belarus,142.0,373,42,14.4,EU


#### Which 10 countries have the lowest total_litres_of_pure_alcohol?

In [56]:
drinks.sort_values('total_litres_of_pure_alcohol', ascending=False).tail(10)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
103,Maldives,0.0,0,0,0.0,AS
106,Marshall Islands,0.0,0,0,0.0,OC
46,North Korea,0.0,0,0,0.0,AS
158,Somalia,0.0,0,0,0.0,AF
147,San Marino,0.0,0,0,0.0,EU
79,Iran,0.0,0,0,0.0,AS
90,Kuwait,0.0,0,0,0.0,AS
128,Pakistan,0.0,0,0,0.0,AS
97,Libya,0.0,0,0,0.0,AF
0,Afghanistan,0.0,0,0,0.0,AS


Side note: This does not change the underlying data. How can we change the underlying data?

#### Let's sort by multiple columns. First sort by `beer_servings` then by `wine_servings`.

In [57]:
drinks.sort_values(['beer_servings', 'wine_servings'])

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0.0,0,0,0.0,AS
13,Bangladesh,0.0,0,0,0.0,AS
46,North Korea,0.0,0,0,0.0,AS
79,Iran,0.0,0,0,0.0,AS
90,Kuwait,0.0,0,0,0.0,AS
...,...,...,...,...,...,...
135,Poland,343.0,215,56,10.9,EU
65,Germany,346.0,117,175,11.3,EU
62,Gabon,347.0,98,59,8.9,AF
45,Czech Republic,361.0,170,134,11.8,EU


## Renaming, Adding, and Removing Columns

#### Rename `beer_servings` as `beer` and `wine_servings` as `wine` in the `drinks` DataFrame, returning a new DataFrame.

In [58]:
renamed_drinks = drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'})

#### Perform the same renaming for `drinks`, but in place.

In [59]:
drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'}, inplace=True)

In [60]:
drinks.head()

Unnamed: 0,country,beer,spirit_servings,wine,total_litres_of_pure_alcohol,continent
0,Afghanistan,0.0,0,0,0.0,AS
1,Albania,89.0,132,54,4.9,EU
2,Algeria,25.0,0,14,0.7,AF
3,Andorra,245.0,138,312,12.4,EU
4,Angola,217.0,57,45,5.9,AF


#### Replace the column names of drinks with `['country', 'beer', 'spirit', 'wine', 'liters', 'continent']`.

In [61]:
drink_cols = ['country', 'beer', 'spirit', 'wine', 'liters', 'continent']
drinks.columns = drink_cols

#### Replace the column names of drinks with ['country', 'beer', 'spirit', 'wine', 'liters', 'continent'] when you import the file.

In [62]:
# header = 0 means the 0th row has existing column names I am replacing
drinks = pd.read_csv('data/drinks.csv', header=0, names=drink_cols)

#### Bonus Tip: What if we have a lot of columns where we want to replace spaces with underscores?

In [63]:
drinks.columns = drinks.columns.str.replace(' ', '_')

#### Make a `servings` column that combines `beer`, `spirit`, and `wine`.

In [64]:
drinks['servings'] = drinks.beer + drinks.spirit + drinks.wine

#### Make an `mL` column that is the `liters` column multiplied by 1,000.

In [65]:
drinks['mL'] = drinks.liters * 1000

#### Remove the `mL` column, returning a new DataFrame.

In [66]:
dropped = drinks.drop('mL', axis=1) # axis=0 for rows, 1 for columns

#### Remove the `mL` and `servings` columns from drinks in place.

In [67]:
drinks.drop(['mL', 'servings'], axis=1, inplace=True)   # Drop multiple columns.

#### What if we want to remove rows instead of column?

In [68]:
drinks.drop([0,1], axis = 0)
# axis = 0 is actually the default, so we wouldn't need to specify. but good idea to be explicit

Unnamed: 0,country,beer,spirit,wine,liters,continent
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF
5,Antigua & Barbuda,102,128,45,4.9,
6,Argentina,193,25,221,8.3,SA
...,...,...,...,...,...,...
188,Venezuela,333,100,3,7.7,SA
189,Vietnam,111,2,1,2.0,AS
190,Yemen,6,0,0,0.1,AS
191,Zambia,32,19,4,2.5,AF


## Axis parameter

#### `axis=0` goes row by row and collapses the values into the mean

In [69]:
drinks.mean(axis=0)

beer      106.160622
spirit     80.994819
wine       49.450777
liters      4.717098
dtype: float64

#### `axis=1` goes column by column and collapses into the mean  for each row (It helps me to think of the number 1 looking like an architectural column)

In [70]:
drinks.mean(axis=1)

0        0.000
1       69.975
2        9.925
3      176.850
4       81.225
        ...   
188    110.925
189     29.000
190      1.525
191     14.375
192     22.675
Length: 193, dtype: float64

#### `axis` has aliases/nicknames that are a bit more intuitive

In [71]:
drinks.mean(axis='index')

beer      106.160622
spirit     80.994819
wine       49.450777
liters      4.717098
dtype: float64

In [72]:
drinks.mean(axis='columns')

0        0.000
1       69.975
2        9.925
3      176.850
4       81.225
        ...   
188    110.925
189     29.000
190      1.525
191     14.375
192     22.675
Length: 193, dtype: float64

## Handling Missing Values

#### Create a dataframe of Booleans indicating which values are missing or not missing.

In [73]:
drinks.isnull()

Unnamed: 0,country,beer,spirit,wine,liters,continent
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
188,False,False,False,False,False,False
189,False,False,False,False,False,False
190,False,False,False,False,False,False
191,False,False,False,False,False,False


In [74]:
drinks.notnull()

Unnamed: 0,country,beer,spirit,wine,liters,continent
0,True,True,True,True,True,True
1,True,True,True,True,True,True
2,True,True,True,True,True,True
3,True,True,True,True,True,True
4,True,True,True,True,True,True
...,...,...,...,...,...,...
188,True,True,True,True,True,True
189,True,True,True,True,True,True
190,True,True,True,True,True,True
191,True,True,True,True,True,True


#### Find the number of missing values by column in `drinks`.

In [75]:
drinks.isnull().sum()       # Count the missing values in each column

country       0
beer          0
spirit        0
wine          0
liters        0
continent    23
dtype: int64

#### Drop rows where ANY values are missing in `drinks` (returning a new DataFrame).

In [76]:
print(drinks.shape)
d = drinks.dropna(how='any') # how='any' is the default, but we are being explicit
print(d.shape)

(193, 6)
(170, 6)


#### Drop rows only where ALL values are missing in `drinks`.

In [77]:
print(drinks.shape)
d = drinks.dropna(how='all')
print(d.shape)

(193, 6)
(193, 6)


#### Filling in NaN Values. What's up with all of these NaN continents?

In [78]:
drinks[drinks['continent'].isnull()].head(7)

Unnamed: 0,country,beer,spirit,wine,liters,continent
5,Antigua & Barbuda,102,128,45,4.9,
11,Bahamas,122,176,51,6.3,
14,Barbados,143,173,36,6.3,
17,Belize,263,114,8,6.8,
32,Canada,240,122,100,8.2,
41,Costa Rica,149,87,11,4.4,
43,Cuba,93,137,5,4.2,


All of these continents are in North America (NA), and, when read in, were misinterpreted as a null or NaN value.

#### Fill in the missing values of the `continent` column using string 'NA'.

In [80]:
drinks.continent.fillna(value='NA', inplace=True) 

## Split-Apply-Combine

#### Find the mean beer servings across the entire `drinks` dataset

In [81]:
drinks.beer.mean()

106.16062176165804

#### But what if we wanted to look at beer servings by continent? This is where`.groupby()` is useful. This filters by each continent and then calculates the mean.

In [82]:
drinks.groupby('continent').beer.mean()

continent
AF     61.471698
AS     37.045455
EU    193.777778
NA    145.434783
OC     89.687500
SA    175.083333
Name: beer, dtype: float64

Use a `.groupby()` whenever you want to analyze a dataset by some category. If you can phrase your question as "For each...", then it is a good candidate for a `.groupby()` For example, "For each continent, what is the mean beer serving?"

#### What happens if we don't specify a column? Let's find the max of all the columns

In [83]:
drinks.groupby('continent').max()

Unnamed: 0_level_0,country,beer,spirit,wine,liters
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AF,Zimbabwe,376,152,233,9.1
AS,Yemen,247,326,123,11.5
EU,United Kingdom,361,373,370,14.4
,USA,285,438,100,11.9
OC,Vanuatu,306,254,212,10.4
SA,Venezuela,333,302,221,8.3


#### Using the `.agg` function we can specify multiple functions at once for our `.groupby()`

In [84]:
drinks.groupby('continent').beer.agg(['count', 'mean', 'min', 'max'])

Unnamed: 0_level_0,count,mean,min,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AF,53,61.471698,0,376
AS,44,37.045455,0,247
EU,45,193.777778,0,361
,23,145.434783,1,285
OC,16,89.6875,0,306
SA,12,175.083333,93,333


## String methods

#### You can use Python's string methods with pandas by using `.str` beore the name of the string method. Remember that many of these string methods use regular expressions. 

In [85]:
drinks.country.str.upper()

0      AFGHANISTAN
1          ALBANIA
2          ALGERIA
3          ANDORRA
4           ANGOLA
          ...     
188      VENEZUELA
189        VIETNAM
190          YEMEN
191         ZAMBIA
192       ZIMBABWE
Name: country, Length: 193, dtype: object

In [86]:
drinks[drinks.country.str.contains('United')]

Unnamed: 0,country,beer,spirit,wine,liters,continent
181,United Arab Emirates,16,135,5,2.8,AS
182,United Kingdom,219,126,195,10.4,EU


## File Contents and Manipulation

The code in this section interacts with the operating system, so it may require some tweaking to work on your machine. Also, keep in mind that the file paths need to refer to the path on your computer.

These file paths are written in Unix/Linux style and will need adjustment if you are working from a Windows machine. 

We can use the `os` module to interact with the files on our computer. 

In [87]:
import os

The `.rename()` method allows us to rename files. 

In [90]:
os.rename("out.txt", "new.txt")

We can also move our file while we rename it. If you don't change the name while doing this, it will just move the file to a new folder.

In [91]:
os.rename("new.txt", "data/new.txt")

We can also create directories if they don't exist with the `.mdir()` module. The `.mkdirs()` module will allow us to make multiple files at once. 

In [None]:
os.mkdir("new_folder")

We can also check to see if a file or directory exists using `os.path.exists()`. This will return a Boolean (True or False).

In [98]:
if os.path.exists("data/titanic.csv"):
    print("The titanic data file exists")

The titanic data file exists


We can use the `shutil` module to make copies of our files.

In [94]:
import shutil

In [95]:
shutil.copy("data/new.txt", "data/copy.txt")

'data/copy.txt'

We can make copies of entire directories with the `.copytree()` method in the module. 

In [96]:
shutil.copytree("data", "data_copy")

'data_copy'

Using the `os` module again, we can remove files, folders, and non-empty folders.

To delete a single file, we use `os.remove()`.

In [100]:
os.remove("data/new.txt")

We can delete empty folders using `os.rmdir()`

In [102]:
os.rmdir("new_folder")

To delete a folder and all the files in it, we use `shutil.rmtree()`

In [103]:
shutil.rmtree("data_copy")

If we want to list our files and folders, we can use `os.listdir()` and input the path of the folder you want to look into as a string as the argument. If you put in "." as the path, that will give you the current working directory. 

In [107]:
for file_name in os.listdir("data"):
    print(file_name)

pluton.csv
seeds.csv
drinks.csv
diamonds.csv
balance.csv
chem.csv
cancer.csv
playgolf.csv
car.csv
ufo.csv
fruit.txt
titanic_test.csv
pokemon.csv
iris.csv
titanic.csv
wine.csv
drug.csv
copy.txt
diabetes.csv
lang.csv
gender_submission.csv


In [108]:
for file_name in os.listdir("."):
    print(file_name)

.DS_Store
ML
day_2.ipynb
hemoglobin.txt
.ipynb_checkpoints
data
day_1.ipynb


## Running External Programs from Python

If we want to run external programs through Python, we use the `subprocess()` module. (Note that we are getting into tricky territory here. 

To run an external program, we use the `subprocess.run()` function. This function takes a single string argument containing the path to the executable (program) we want to run. 

Let's try it out with a bash shell command `ls` (which will list our directories and files). We add the second argument `shell=True` to let it know we are wanting it to run as a shell command. 

In [109]:
import subprocess

In [113]:
subprocess.run("ls", shell = True)

CompletedProcess(args='ls', returncode=0)

The output from the program will get printed to our screen. (Note we could have also passed it a list and executed a few commands). This returns the "returncode" of 0 to let us know it ran.

Let's run an external program (`ls` again) and store the output in a variable so we can do something useful with it later. For this, we'll use `subprocess.check_output()` which takes the same arguments as we saw with `subprocess.run()`. 

In [114]:
ls = subprocess.check_output("ls", shell=True)

In [115]:
print(ls)

b'ML\ndata\nday_1.ipynb\nday_2.ipynb\nhemoglobin.txt\n'


Notice that we got multiple lines separated by newline characters (`\n`). 

## Taking User Input

To interact with the program user and get their input, we can use the `input()` function. This function takes a string as its argument. 

This will get stored as a string and will have a newline character afte rit. So if you want a number, you can change the data type. And remember `.rstrip()` removes newlines if you don't want that.

In [119]:
name = input("What is your name?")
print("Hey " + name + "!")

What is your name?Nichole
Hey Nichole!


When we take input from the user, we open ourselves up to new and fun errors if they don't know they are supposed to input something or input it in the wrong format. So, it's a good idea to do input validation to make sure their input makes sense. 

This sort of "defensive programming" and testing is really important to creating quality code, and we'll come back to it! We'll learn exceptions later, and that will be a better way to check our user input.

We may also want to get input from the command line. This course won't go into the command line much, but here is how we can interact with command line arguments using the `sys` module. 

Let's say we were running a command line program called `my_program` and gave it a few arguments:

`my_program one two three` 

We could get those arguments using `sys.argv`

`import sys`

`print(sys.argv)`

This would return:

`[my_program, one, two, three]`

(It first returns the name of the program, then the arguments)

# Independent Practice

### Creating a FASTA file
FASTA is a file format that is used to store DNA and protein sequence data. The header row has a greater than symbol and the accession name. There may be multiple sequences in one file.

\>sequence_one
<br>
GTTTCAAAGAT
<br>
\>sequence_two
<br>
ATCAGATCGGA
<br>
\>sequence_three
<br>
ACTGCATCGTACT


Write a Python program that will make FASTA files for the following sequences. Make sure all are in uppercase letters.

SEQ1: atcggccatctagccgg
<br>
SEQ2: ACTGTACATGTGCGCTAG
<br>
SEQ3: ccatctagcTGTAC

### Creating Multiple FASTA files
Use the sequences from the previous exercises but instead create three new files in the FASTA format (one sequence per file). The names of the files will be the same as the sequence names and end in ".fasta"

### Pandas Practice: Importing and Inspecting Data

Load the `drinks.csv` data.  

Perform the following:  

1. Print the head and tail.
2. Look at the index, columns, dtypes, and shape.
3. Assign the beer_servings column/Series to a variable.
4. Calculate summary statistics for beer_servings.
5. Calculate the mean of beer_servings.
6. Count the values of unique categories in continent. (.value_counts)
7. Print the dimensions of the drinks DataFrame.
8. Find the first three items of the value counts of the occupation column.

In [107]:
drinks = pd.read_csv("data/drinks.csv")

In [108]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [109]:
drinks.tail()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
188,Venezuela,333,100,3,7.7,SA
189,Vietnam,111,2,1,2.0,AS
190,Yemen,6,0,0,0.1,AS
191,Zambia,32,19,4,2.5,AF
192,Zimbabwe,64,18,4,4.7,AF


In [110]:
drinks.index

RangeIndex(start=0, stop=193, step=1)

In [111]:
drinks.dtypes

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

In [112]:
drinks.shape

(193, 6)

In [113]:
beer_servings = drinks['beer_servings']

In [114]:
drinks.describe()

Unnamed: 0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
count,193.0,193.0,193.0,193.0
mean,106.160622,80.994819,49.450777,4.717098
std,101.143103,88.284312,79.697598,3.773298
min,0.0,0.0,0.0,0.0
25%,20.0,4.0,1.0,1.3
50%,76.0,56.0,8.0,4.2
75%,188.0,128.0,59.0,7.2
max,376.0,438.0,370.0,14.4


In [115]:
drinks.beer_servings.describe()

count    193.000000
mean     106.160622
std      101.143103
min        0.000000
25%       20.000000
50%       76.000000
75%      188.000000
max      376.000000
Name: beer_servings, dtype: float64

In [116]:
drinks.beer_servings.mean()

106.16062176165804

In [117]:
drinks.continent.value_counts()

AF    53
EU    45
AS    44
OC    16
SA    12
Name: continent, dtype: int64

### Pandas Practice: Filtering 

#### Using the UFO data ("ufo.csv")

1. Read in the data.
2. Check the shape and describe the columns.
3. Find the four most frequently reported colors.
4. Find the most frequent city for reports in state VA.
5. Find only UFO reports from Arlington, VA.
6. Find the number of missing values in each column.
7. Show only UFO reports where city is missing.
8. Count the number of rows with no null values.
9. Amend column names with spaces to have underscores.
10. Make a new column that is a combination of city and state.


**Bonus:** Drop rows where City or Shape Reported is missing.

In [160]:
# read in the csv as a dataframe
ufo = pd.read_table("data/ufo.csv", sep=',')

In [161]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [162]:
# Check the shape of the DataFrame.
ufo.shape

(80543, 5)

In [163]:
# Calculate the most frequent value for each of the columns in a single command.
ufo.describe()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
count,80496,17034,72141,80543,80543
unique,13504,31,27,52,68901
top,Seattle,ORANGE,LIGHT,CA,7/4/2014 22:00
freq,646,5216,16332,10743,45


In [164]:
# What are the four most frequently reported colors?
ufo['Colors Reported'].value_counts().head(4)

ORANGE    5216
RED       4809
GREEN     1897
BLUE      1855
Name: Colors Reported, dtype: int64

In [165]:
# For reports in `VA`, what's the most frequently listed city?
ufo[ufo.State=='VA'].City.value_counts().head(1)

Virginia Beach    110
Name: City, dtype: int64

In [166]:
# Show only the UFO reports from Arlington, VA.
ufo[(ufo.City=='Arlington') & (ufo.State=='VA')]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
202,Arlington,GREEN,OVAL,VA,7/13/1952 21:00
6300,Arlington,,CHEVRON,VA,5/5/1990 21:40
10278,Arlington,,DISK,VA,5/27/1997 15:30
14527,Arlington,,OTHER,VA,9/10/1999 21:41
17984,Arlington,RED,DISK,VA,11/19/2000 22:00
21201,Arlington,GREEN,FIREBALL,VA,1/7/2002 17:45
22633,Arlington,,LIGHT,VA,7/26/2002 1:15
22780,Arlington,,LIGHT,VA,8/7/2002 21:00
25066,Arlington,,CIGAR,VA,6/1/2003 22:34
27398,Arlington,,VARIOUS,VA,12/13/2003 2:00


In [167]:
# Count the number of missing values in each column.
ufo.isnull().sum()

City                  47
Colors Reported    63509
Shape Reported      8402
State                  0
Time                   0
dtype: int64

In [168]:
# Show only the UFO reports in which the `city` is missing.
ufo[ufo.City.isnull()]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
21,,,,LA,8/15/1943 0:00
22,,,LIGHT,LA,8/15/1943 0:00
204,,,DISK,CA,7/15/1952 12:30
241,,BLUE,DISK,MT,7/4/1953 14:00
613,,,DISK,NV,7/1/1960 12:00
1877,,YELLOW,CIRCLE,AZ,8/15/1969 1:00
2013,,,,NH,8/1/1970 9:30
2546,,,FIREBALL,OH,10/25/1973 23:30
3123,,RED,TRIANGLE,WV,11/25/1975 23:00
4736,,,SPHERE,CA,6/23/1982 23:00


In [169]:
# How many rows remain if you drop all rows with any missing values?
ufo.dropna().shape[0]

15510

In [170]:
# Replace any spaces in the column names with underscores.
ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True)

In [171]:
# Create a new column called `location` that includes both `city` and `state`.
# For example, the `location` for the first row would be `Ithaca, NY`.
ufo['Location'] = ufo.City + ', ' + ufo.State

In [174]:
# Bonus: drop rows where city or shape reported is missing
subset = ufo.dropna(subset=['City', 'Shape_Reported'], how= 'all')
subset.shape

(80539, 6)

### Fire Ant DNA Sequences
In the data folder there is a file called "solenopsis_invicta.txt" that contains genomic data from the Red Imported Fire Ant. Write out each DNA sequence into its own separate file. 

**Bonus:** one for sequences between 100 and 199 bases long, one for sequences between 200 and 299 bases long, etc. Write out each DNA sequence in the input files to a separate file in the appropriate folder.


In [1]:
?open