## Basics of Data handling:

- Reading Files with Open
- Writing Files with Open
- Loading Data with Pandas
- Pandas: Working with and Saving Data
- One Dimensional Numpy
- Two Dimensional Numpy

### Data Structures:
- Series: A sequence (one-dimensional) of data values, each data value identified with a unique index.
- Data frame: A two-dimensional structure of data values, each data value is identified with a pair (r,c). Usually each row r is an index number and each column c is a name. 

Lets read a data frame provided as .csv . Here is the code:



In [2]:
import pandas as pd

pd.read_csv('../data/adult.data',delimiter = ',')

FileNotFoundError: File b'../data/adult.data' does not exist




- Python Demonstration: Reading and Writing CSV files
- Python Dates and Times
- Advanced Python Objects, map()
- Advanced Python Lambda and List Comprehensions2m
- Advanced Python Demonstration: The Numerical Python Library (NumPy)


In this week of the course you'll learn the fundamentals of one of the most important toolkits Python has for data cleaning and processing -- pandas. You'll learn how to read in data into DataFrame structures, how to query these structures, and the details about such structures are indexed.

- The Series Data Structure
- Querying a Series
- The DataFrame Data Structure
- DataFrame Indexing and Loading
- Querying a DataFrame
- Indexing Dataframes
- Missing Values

In this week you'll deepen your understanding of the python pandas library by learning how to merge DataFrames, generate summary tables, group data into logical pieces, and manipulate dates. We'll also refresh your understanding of scales of data, and discuss issues with creating metrics for analysis.
Merging Dataframes
- Pandas Idioms
- Group by
- Scales
- Pivot Tables
- Date Functionality

In this week of the course you'll be introduced to a variety of statistical techniques such a distributions, sampling and t-tests. The majority of the week will be dedicated to your course project, where you'll engage in a real-world data cleaning activity and provide evidence for (or against!) a given hypothesis. This project is suitable for a data science portfolio, and will test your knowledge of cleaning, merging, manipulating, and test for significance in data. The week ends with two discussions of science and the rise of the fourth paradigm -- data driven discovery. 

- core concept of distributions: flip a coin. a probability of landing heads up and a probability of landing tails up. If we flip a coin many times we collect a number of measurements of the heads and tails that landed face up and intuitively we know that the number of times we get a heads up will be equal about the number of times we get a tails up for a fair coin. If you flipped a coin a hundred times and you received heads each time you'd probably doubt the fairness of that coin. We can consider the result of each flip of this coin as a random variable.
0:43
When we can consider the set of all possible random variables together we call this a distribution. In this case the distribution is called binomial since there's two possible outputs a heads or a tails. It's also an example of a discreet distribution since there are only categories being used a heads and a tails and not real numbers.
1:06
NumPy actually has some distributions built into it allowing us to make random flips of a coin with given parameters. Let's give it a try.
1:14
Here we ask for a number from the NumPy binomial distribution. We have two parameters to pass in. The first is the number of times we want it to run. The second is the chance we get a zero, which we will use to represent heads here. Let's run one round of this simulation.
1:31
Great so if you're following along in a Jupyter notebook you either got a zero or a one. And half of you got a value that agreed with the one that I got.
1:39
What if we run the simulation a thousand times and divided the result by a thousand. Well you see a number pretty close to 0.5 which means half of the time we had a heads and half of the time we had a tails.
1:50
Of course an even weighted binomial distribution is only one simple example. We can also have unevenly weighted binomial distributions. For instance what's the chance although we're tornado today while I’m filming. It's pretty low even though we do get tornadoes here. So maybe there a hundredth of a percentage chance. We can put this into a binomial distribution as a weighting in NumPy. If we run this 100,000 times we see there are pretty minimal tornado events.
2:20
So you might be wondering why I'm talking about such a simple and intuitive distribution. I mean we all understand flipping a coin for when we had to make important decisions as children.
2:31
But what I want to demonstrate is that the computational tools are starting to allow us to simulate the world which helps us answer questions. I could have shown you the math behind this so we could have worked out the probabilities. But a simulation is essentially another form of inquiry. Let's take one more example. Let's say the chance of a tornado here in Ann Arbor on any given day, is 1% regardless of the time of year. That's higher than realistic but it makes for a quicker demo. And lets say if there's a tornado I'm going to get away from the windows and hide, then come back and do my recording the next day. So what's the chance of this happening two days in a row?
3:08
Well we can use the binomial distribution in NumPy to simulate this.
3:13
Here we create an empty list and we create a number of potential tornado events by asking the NumPy binomial function using our chance of tornado. We'll do this a million times which is just shy of 3,000 years worth of events.
3:28
This process is called sampling the distribution.
3:31
Now we can write a little loop to go through the list and look for any two adjacent pairs of ones which means that there were two days that had back to back tornadoes. We see that this ends up being roughly 102 day tornado events over the 3,000 years. Which frankly is still too many for me. My point here though is that modern computational power allows us to very quickly simulate the effects of different parameters in a distribution. Leading to a new way of understanding the problem. You don't have to work out all the math you can quite often simulate the problem instead and observe the results.
4:08
In the next lecture we'll talk a little bit more about distributions.