# Working with columns

### Introduction

In the last lesson, we learned about selecting loading in data with a pandas dataframe, and then selecting rows.  But a lot of what we'll be doing in exploring data is selecting columns.  

For example, in the movies dataset that we've been working with, there are many columns that we simply don't need, and then there are other columns that we may want to explore more closely.  Ok, let's see how.

### Exploring Columns

We'll start by loading up our data once again.

In [2]:
import pandas as pd
url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv'
df = pd.read_csv(url)
df[:1]

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013$,domgross_2013$,intgross_2013$,period code,decade code
0,2013,tt1711425,21 &amp; Over,notalk,notalk,FAIL,13000000,25682380.0,42195766.0,2013FAIL,13000000,25682380.0,42195766.0,1.0,1.0


Now we can see a list of all of the columns in our dataframe, with the following.

In [6]:
df.columns

Index(['year', 'imdb', 'title', 'test', 'clean_test', 'binary', 'budget',
       'domgross', 'intgross', 'code', 'budget_2013$', 'domgross_2013$',
       'intgross_2013$', 'period code', 'decade code'],
      dtype='object')

### Selecting a single column

And from there, we can see a specific column, by using our bracket accessors.

In [7]:
df['year']

0       2013
1       2012
2       2013
3       2013
4       2013
        ... 
1789    1971
1790    1971
1791    1971
1792    1971
1793    1970
Name: year, Length: 1794, dtype: int64

Now another way to select a specific column is with the dot notation.

In [8]:
df.year

0       2013
1       2012
2       2013
3       2013
4       2013
        ... 
1789    1971
1790    1971
1791    1971
1792    1971
1793    1970
Name: year, Length: 1794, dtype: int64

Now we oftentimes will assign the item we want to predict, our target, to the variable `y`.

Let's assign the column `domgross` to the variable `y`.

However, the dot notation *cannot* be used with some column names, like those with spaces.

In [9]:
df.decade code

SyntaxError: invalid syntax (<ipython-input-9-c600b1b3a395>, line 1)

### Selecting mulitple columns

Now let's move onto selecting multiple columns.  Once again, we'll take a look at all of our columns.

In [10]:
df.columns

Index(['year', 'imdb', 'title', 'test', 'clean_test', 'binary', 'budget',
       'domgross', 'intgross', 'code', 'budget_2013$', 'domgross_2013$',
       'intgross_2013$', 'period code', 'decade code'],
      dtype='object')

And now let's select the columns `year` and `title`.

In [12]:
columns = ['year', 'title']
selected_df = df[columns]
selected_df[:3]

Unnamed: 0,year,title
0,2013,21 &amp; Over
1,2012,Dredd 3D
2,2013,12 Years a Slave


So we just greatly reduced the number of columns, and assigned this smaller dataframe to `selected_df`.  Let's go over how we did this.

We used the following format:

* dataframe, bracket accessors, list of columns 

```python
df[ ['col_1', 'col_2']]
```

It can be hard to keep track of all of those brackets, so it is nice to first assign the list of columns to a variable.

In [14]:
columns = ['year', 'title']
selected_df = df[columns]

selected_df[:3]

Unnamed: 0,year,title
0,2013,21 &amp; Over
1,2012,Dredd 3D
2,2013,12 Years a Slave


Notice that if we select a single column from a dataframe, we are working with a series.

In [4]:
type(df['year'])

pandas.core.series.Series

But if we select multiple columns from a dataframe, we have a dataframe.

In [6]:
type(df[['year', 'title']])

pandas.core.frame.DataFrame

### Dropping Columns

Just like we can select columns, we can also drop columns.

In [3]:
import pandas as pd
url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv'
df = pd.read_csv(url)

In [4]:
df.columns

Index(['year', 'imdb', 'title', 'test', 'clean_test', 'binary', 'budget',
       'domgross', 'intgross', 'code', 'budget_2013$', 'domgross_2013$',
       'intgross_2013$', 'period code', 'decade code'],
      dtype='object')

With the `drop` method we can provide a list of columns to drop.  The method returns to us a new dataframe with the specified columns removed.

In [13]:
df_dropped = df.drop(columns = ['period code', 'clean_test', 'binary', 'decade code', 'code'])
df_dropped[:4]

Unnamed: 0,year,imdb,title,test,budget,domgross,intgross,budget_2013$,domgross_2013$,intgross_2013$
0,2013,tt1711425,21 &amp; Over,notalk,13000000,25682380.0,42195766.0,13000000,25682380.0,42195766.0
1,2012,tt1343727,Dredd 3D,ok-disagree,45000000,13414714.0,40868994.0,45658735,13611086.0,41467257.0
2,2013,tt2024544,12 Years a Slave,notalk-disagree,20000000,53107035.0,158607035.0,20000000,53107035.0,158607035.0
3,2013,tt1272878,2 Guns,notalk,61000000,75612460.0,132493015.0,61000000,75612460.0,132493015.0


### Summary

In this lesson, we learned about how to select columns from our pandas dataframe.  We can start by seeing all of the columns with `columns` method.

In [23]:
df.columns

Index(['year', 'imdb', 'title', 'test', 'clean_test', 'binary', 'budget',
       'domgross', 'intgross', 'code', 'budget_2013$', 'domgross_2013$',
       'intgross_2013$', 'period code', 'decade code'],
      dtype='object')

We can select a single column by either using the bracket accessors or the dot notation, and then assign that column a variable.

In [18]:
year = df['year']

In [19]:
year = df.year

We can select multiple columns by still using the bracket accessors, and then passing through a list of columns that we would like to select.

In [3]:
cols = ['year', 'title']
selected = df[cols]
selected[:3]

Unnamed: 0,year,title
0,2013,21 &amp; Over
1,2012,Dredd 3D
2,2013,12 Years a Slave


We can also drop multiple columns with the `drop` method.

In [15]:
df_dropped = df.drop(columns = ['period code', 'clean_test', 'binary', 'decade code', 'code'])
df_dropped[:3]

Unnamed: 0,year,imdb,title,test,budget,domgross,intgross,budget_2013$,domgross_2013$,intgross_2013$
0,2013,tt1711425,21 &amp; Over,notalk,13000000,25682380.0,42195766.0,13000000,25682380.0,42195766.0
1,2012,tt1343727,Dredd 3D,ok-disagree,45000000,13414714.0,40868994.0,45658735,13611086.0,41467257.0
2,2013,tt2024544,12 Years a Slave,notalk-disagree,20000000,53107035.0,158607035.0,20000000,53107035.0,158607035.0
