# Dataframes and Functions

In this worksheet, we will start to explore data. The focus here will be on reading in data files and learning the basic syntax of functions, how to decipher and write them. We have already seen a couple of functions, re.search() and re.findall(). 

The following functions will help us to describe and better understand our dataset. We will move from importing our libraries, defining a path to load the dataset, and then perform some basic functions to describe the data.

### Goals
1. Import you own data as a dataframe.
2. Filter and sort your data using functions like .query().
3. Use for loops and if-else statements to manage your data.

## Importing Libraries and Defining a Path Variable

We have already imported one library (re). This will be a similar process but we'll introduce another feature.

In [8]:
import re
import pandas as pd
# To install pandas with conda, open a new terminal and enter, "conda install -c anaconda pandas"

Here, we're not only importing a library called, "pandas," we're re-naming it in this instance as "pd." Much like we had to identify functions by their origin library (re.search, re.findall), we will also need to specify functions as pandas functions. Rather than writing, pandas.Function-Name, we can simply write pd. Many of the functions further down will declare functions as pandas (pd) functions.

With our libraries imported, we will now define a string variable that will be useful for loading data and working with file paths.

In [9]:
abs_dir = "/Users/williamquinn/Desktop/DH/Python/Teaching/Python-Notebooks/"
# abs_dir is an arbitrary variable name that I use for "absolute directory."
# This addition is not necessary but is a quality of life change.
# Rather than writing this file path every time, I can use this string variable instead.

As you'll notice, this file path is specific to my computer. Your computer likely won't have a "Users/williamquinn/" directory. You should change the above string to reflect your own file path. If you're unsure what that is, you can enter the code below which will tell you the file path of your working directory (where you are on your computer when running Python).

In [3]:
import os

os.getcwd()

'/Users/williamquinn/Desktop/DH/Python/Teaching/Python-Notebooks'

We first have to import os, which is a default Python library. You should NOT need to install this with conda. You'll also notice that "abs_dir" is almost identical to the working directory. The only difference is that I've added a slash ("/") to the end of abs_dir.


## Read in Data with Pandas (pd.)

With these libraries and a file path variable, we can now load data and create a variable.

In [10]:
data = pd.read_csv(abs_dir + "data/fake-and-real-news-dataset/True.csv", sep = ",")

data.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


Like other functions we've seen so far, the pd.read_csv function as some fimilar attributes. Pd.read_csv will create a dataframe, which is what we'll use to process data. We have two arguments, the file path to"True.csv" and something called "sep." 

```python
abs_dir + "data/fake-and-real-news-dataset/True.csv"
```

The first argument unsurprisingly tells pd.read_csv where to look. We are using abs_dir, which is a string variable and attaching the following string.

``` python
sep = ','
```
The second argument is short for "separation." Sep tells pd.read_csv how to make columns. In this case, means create a new column whenever there is a (non-escaped) comma. A character value that separates data is called a delimiter. We could change the comma to a tab ("\t") or another character we chose. 

You might also know that the "csv" in read_csv is short for comma-separated-values. We can still use pd.read_csv to open files separated by other delimiters. A .csv file can still use tabs to separate values.

```python
data.head
```

.head is a built-in Python function that provides a glimpse at the data rather than loading every row into the output. Jupyter Notebook will limit how many rows to print in the output automatically, but .head limits it to the first five rows.

### Adding Values to Dataframes

As we can see, this data has the necessary columns for text analysis: text (obviously) but also metadata about that text, like date, subject, and title. But, how do we know that these texts are true (or, perhaps more accurately, professional journalism)? Other than the file name, which is True.csv, the data itself does not have a category that indicates its veracity.

Before we combine the real news with fake news, we might want to add a column that indicates what the file title tells us. While the reasons for adding metadata that classifies these texts will be more apparent later, for now it is useful to know how to interact with dataframes.

In [11]:
data['veracity'] = "real"

data.head()

Unnamed: 0,title,text,subject,date,veracity
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",real
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",real
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",real
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",real
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",real


As we can now see, we've added a new column called "veracity" and filled every row in the column with a string value, "true."

You may also have guessed what the following syntax means:

```python
data['veracity']
```

"data" is our dataframe. The square brackets after a named dataframe access a specific column. We then make all the values in that column the same value.

We can run functions directly to a specific column rather than the entire dataframe using the same syntax.

In [6]:
data['subject'].unique()

array(['politicsNews', 'worldnews'], dtype=object)

The .unique() function above searches through the entire "subject" column for unique values. This function can be useful if you ever need to debug broken code. I often find that .unique() helps when I mistype a value.

## Joining Dataframes

The "True.csv" represents only half of our data, though. Before we start cleaning the text, we should first load "False.csv" and join it to our true data.

In [12]:
# We are creating a new variable for our data called "fake" with pd.read_csv.
fake = pd.read_csv(abs_dir + "data/fake-and-real-news-dataset/Fake.csv", sep = ",")

# Assuming the columns are the same in both files, 
# we'll want to add another column value to indicate this is fake news
fake['veracity'] = 'fake'

# We will now override our "data" variable by joining the original "data" with our "fake" data.
data = pd.concat([data, fake], axis=0, sort=False)

print (data.shape)
data.head()

(44898, 5)


Unnamed: 0,title,text,subject,date,veracity
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",real
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",real
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",real
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",real
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",real


There are a few new concepts in the code above.

```python
data = pd.concat([data, fake], axis=0, sort=False)
```
#### pd.concat

So far, pd.concat, short for concatenate, behaves as we would expect.

#### [data, fake]

The first argument of pd.concat, however, is a list. The list contains our two dataframes so far. These two items within the list are what we want to join together. We can use "data" here in two place because setting it as a variable (data = ) will essentialy override the original "data" dataframe we created.

#### axis=0

The second argument of the function, axis=0, is new. The "axis" argument is common for pandas dataframes. Axis=0 indicates a column-wise behavior.

#### sort=False

Notice that False is a different color. In Python, True and False are boolean operators. As expected, sort=False tells pd.concat NOT to sort the data.

#### print (data.shape)

Looking back at the output of data.head(), you might notice an additional line: (115341, 5). This is the result of our print function. It means the shape of "data" is 115,341 rows by 5 columns. print() allows you to print more than one result in the output of the cell.

You might also notice that data.shape is peculiar. It looks like a function (and is) but it does not have parenthesis. I can't explain why, to be honest. While I'm sure there's (perhaps) a good reason, the .shape function is a good reminder that we don't necessarily need to be computer scientists to use Python. Often, if you encounter an error, someone else has encountered the same one and found a solution online. (As you'll likely find out, translating solutions from online sources can be its own skillset, one that you'll acquire gradually over time.)

In the cell below, 
1. Use the second cell and the print function at least once to write the unique values of data['subject'] and data['veracity'].
2. Try outputting data.shape(). It won't work, but it can be useful to see and (over time) become familiar with deciphering errors. 

In [None]:
# Print the unique values of data['subject']

# Print the unique values of data['veracity']

# Produce an error by trying data.shape()
# This should be the last line you run because an error will interrupt the entire cell whenever it appears.


## Using conditions to inspect dataframes

The last thing we'll learn in this notebook is to inspect the results of dataframes that fulfill specified conditions. Conditions are parameters that we can set to limit data. For example, even though we've joined both True.csv and Fake.csv, we can effectively separate them again by setting conditions that must be true to be included in the new dataset.

In [9]:
fake = data.query('veracity == "fake"')

print (fake.shape)
fake.head()

(23481, 5)


Unnamed: 0,title,text,subject,date,veracity
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",fake
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",fake
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",fake
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",fake
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",fake


We've re-create the fake dataframe by using a new function .query. 

There are a handful of conditions that you might recognize from math classes:

* Equals: ==
* Not equals: != 
* Less than: <
* Less than or equal to: <=
* Greater than: >
* Greater than or equal to: >=

However,
```python
data.query('veracity == "fake"')
```
isn't working with numbers. Instead, .query is asking each row: does your "veracity" column equal the string "fake." If the answer is yes, then that row gets to stay.

You should also notice the use of quotations. Because .query uses single quotation marks within the parenthesis, any string variable has to be within double-quotation marks.



## Loops

What if you want to further specify and limit your corpus? There are built in Python functions that allow you to create your own functions...

What if you want all the news articles (fake and real) from December only?

### For Loops

In [10]:
# Before we attempt a loop on our dataset, let's practice on a simple list.
our_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# For Loop
# A for loop will recognize every distinct item or object within another a list or object.
for i in our_list:
#     Notice, the colon and indentation. 
#     The first line of a loop must end with a colon.
#     The following line must be indented, either a tab or 4 spaces.
    print (i)
    
print (our_list)

1
2
3
4
5
6
7
8
9
10
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


The for loop we just created broke down "our_list" into individual units, what we've called "i" ("i" could be any name you choose). Python recognizes that the structure of "our_list" is a list, and the for loop should behave in a way that handles each item individually. The two print functions (print (i) and print (our_list)) should illustrate how the for loop operates differently than simply printing the entire list.

You might also notice "in" there. A for loop looks for units within an object. The syntax of "in" tells the for loop where to look.

## If-Else Statements

Perhaps, we don't want to print every single item within a list. Instead, we want only to print the 7 integer. We can add an "if statement" to our for loop to further constrain the operation.

In [11]:
for i in our_list:
    if i==7:
        print (i)
    else:
        print ("This number is not 7")

This number is not 7
This number is not 7
This number is not 7
This number is not 7
This number is not 7
This number is not 7
7
This number is not 7
This number is not 7
This number is not 7


The if-else statement follows the same construction as the for loop. Here, we are telling the for loop to check if an item in our_list equals 7 (i==7). If it does not equal 7 (i!=7), then the loop should print: "This number is not 7." If i does equal 7, then it should print "i," which is 7.

These fundamental loops can be very effective when trying to interact with your dataset. With for loops and if-else statements, we can reduce our dataset to news (fake and real) published in the month of December.

In [12]:
# First, we'll create an empty dataframe to store new rows as we come across them.
new_data = pd.DataFrame()

# Next, we'll write a "for loop."
#     data.iterrows() takes our "data" dataframe and makes it iterable. That is, it allows a for loop to read it.
#     .iterrows() also requires two variables. We have "index" and "row" here
#     But it could just as well be "i", "r"
for index, row in data.iterrows():

#     Here, we will check for the string "December" in the date column of each row.
    if "December" in row['date']:
#         The last step is appending each row that meets our if statement to new_data.
        new_data = new_data.append([row])

#     This else statement tells the for loop to continue (basically ignore) rows that don't meet the if condition.
    else:
        pass
    

print (new_data.shape)
new_data.head()

(3538, 5)


Unnamed: 0,title,text,subject,date,veracity
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",real
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",real
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",real
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",real
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",real


The code above is more complex than we've seen so far, but we can make sense of it when we think about it in simpler steps.

1. Create a new, empty dataframe.
2. Create a for loop to iterate through each row (and index) of our old dataframe.
3. Query the 'date' column of that row for the string "December"
4. Append the entire row to the old dataframe if the if-else condition is met.

When creating more complicated functions, it can be useful to go step-by-step.

## Conclusion

In this notebook, we covered a lot of foundational operations in Python. While you might be able to copy and paste certain functions (pd.read_csv), other code loops (for, if, in) requires more practice. Again the goal of these notebooks or learning code in general is NOT to memorize as much as you can. It's far more effective to learn how to find solutions online. Hopefully, though, this particular notebook might work as a reference guide for you in case you forget how to read a csv file or use a for loop.

As we'll see in later notebooks, the kind of coding we'll be doing relies on using tried and true methods. This notebook will hopefully provide you with a model for how to:
1. Import you own data as a dataframe.
2. Filter and sort your data using functions like .query().
3. Use for loops and if-else statements to manage your data.

## Exercises

Looking at the text column of our dataframe ('data'), we can see a re-occuring pattern: many news articles begin with a place of publication and the publisher. Can you write a regex function to remove that information? Better yet, can you move that information out of the text column and into a new column?