
# What is Data Science?

Data Science is an interdisciplinary field that aims to extract knowledge and insights from big and messy data. The field encompasses Computer Science, Statistics and a domain field.



![alternatvie text](https://live.staticflickr.com/65535/53229478936_4fd464b063_h.jpg)

In order to uncover useful intelligence for their organizations, data scientists must master the full spectrum of the data science life cycle and possess a level of flexibility and understanding to maximize returns at each phase of the process.



![alternatvie text](https://live.staticflickr.com/65535/53229979195_4acbf4ff2e_h.jpg)

The Data Science life cycle contains several phases. First data is gathered from various sources (either provided as CSV or JSON files, or gathered from an API). Data Scientists must then clean the dataset and then  visualize patterns and trends within the dataset. Next, the models are constructed to predict or forecast trends. Finally, knowledge is extracted to better our prediction models.

We will be using the Pandas library to read-in, clean, visualize and analyze datasets. Pandas is a package that allows efficient manipulation of a dataset. Datasets are managed as a 'DataFrame' which are essentially spreadsheets where we can access a row and/or column. Pandas implements a number of powerful data operations to manage spreadsheets.



First, we need to import pandas

In [23]:
import pandas as pd

What does data look like? For most people, the first image that comes to mind is a spreadsheet, with numbers neatly arranged in a table of rows and columns. One goal of this book is to get you to think beyond tables of numbers---to recognize that the words in a book and the markers on a map are also data to be collected, processed, and analyzed. But a lot of data is still organized into tables, so it is important to know how to work with tabular data.

Let's look at a tabular data set. Shown below are the first 5 rows of a data set about the passengers on the Titanic. This data set contains information about each passenger (e.g., name, sex, age), their journey (e.g., the fare they paid, their destination), and their ultimate fate (e.g., whether they survived or not, the lifeboat they were on).

In a tabular data set, each row represents a distinct observation and each column a distinct variable. Each observation is an entity being measured, and variables are the attributes we measure. In the Titanic data set above, each row represents a passenger on the Titanic. For each passenger, 14 variables have been recorded, including pclass (their ticket class: 1, 2, or 3) and boat (which lifeboat they were on, if they survived).

In Python, the pandas library provides a convenient data structure for storing tabular data, called the DataFrame.

In [24]:

titanic = pd.read_csv('Titanic.csv')

In [25]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In a DataFrame, each observation is identified by an index. You can determine the index of a DataFrame by looking for the bolded values at the beginning of each row when you print the DataFrame. For example, notice how the numbers 0, 1, 2, 3, 4, ... above are bolded, which means that this DataFrame is indexed by integers starting from 0. This is the default index when you read in a data set from disk into pandas, unless you explicitly specify otherwise.

Since each row represents one passenger, it might be useful to re-index the rows by the name of the passenger. To do this, we call the .set_index() method of DataFrame, passing in the name of the column we want to use as the index. Notice how name now appears at the very left, and the passengers' names are all bolded. This is how you know that name is the index of this DataFrame.



In [28]:
titanic.set_index("Name", inplace=True)

Now that we have set the (row) index of the DataFrame to be the passengers' names, we can use the index to select specific passengers. To do this, we use the .loc selector. The .loc selector takes in a label and returns the row(s) corresponding to that index label.

For example, if we wanted to find the data for the father of the Allison family, we would pass in the label "Allison, Master. Hudson Trevor" to .loc. Notice the square brackets.

In [30]:
titanic.loc["Allison, Master. Hudson Trevor"]

PassengerId        306
Survived             1
Pclass               1
Sex               male
Age               0.92
SibSp                1
Parch                2
Ticket          113781
Fare            151.55
Cabin          C22 C26
Embarked             S
Name: Allison, Master. Hudson Trevor, dtype: object

Lets say we only want to look at the age column. We can do that in multiple ways. Suppose we want to select the age column from the DataFrame. There are three ways to do this.

Use .loc, specifying both the rows and columns. (Note: The colon : is Python shorthand for "all".)

In [33]:
titanic.loc[:, "Age"]

Name
Braund, Mr. Owen Harris                                22.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    38.0
Heikkinen, Miss. Laina                                 26.0
Futrelle, Mrs. Jacques Heath (Lily May Peel)           35.0
Allen, Mr. William Henry                               35.0
                                                       ... 
Montvila, Rev. Juozas                                  27.0
Graham, Miss. Margaret Edith                           19.0
Johnston, Miss. Catherine Helen "Carrie"                NaN
Behr, Mr. Karl Howell                                  26.0
Dooley, Mr. Patrick                                    32.0
Name: Age, Length: 891, dtype: float64

In [35]:
titanic["Age"]

Name
Braund, Mr. Owen Harris                                22.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    38.0
Heikkinen, Miss. Laina                                 26.0
Futrelle, Mrs. Jacques Heath (Lily May Peel)           35.0
Allen, Mr. William Henry                               35.0
                                                       ... 
Montvila, Rev. Juozas                                  27.0
Graham, Miss. Margaret Edith                           19.0
Johnston, Miss. Catherine Helen "Carrie"                NaN
Behr, Mr. Karl Howell                                  26.0
Dooley, Mr. Patrick                                    32.0
Name: Age, Length: 891, dtype: float64

In [36]:
titanic.Age

Name
Braund, Mr. Owen Harris                                22.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    38.0
Heikkinen, Miss. Laina                                 26.0
Futrelle, Mrs. Jacques Heath (Lily May Peel)           35.0
Allen, Mr. William Henry                               35.0
                                                       ... 
Montvila, Rev. Juozas                                  27.0
Graham, Miss. Margaret Edith                           19.0
Johnston, Miss. Catherine Helen "Carrie"                NaN
Behr, Mr. Karl Howell                                  26.0
Dooley, Mr. Patrick                                    32.0
Name: Age, Length: 891, dtype: float64

There is a fundamental difference between variables like age and fare, which can be measured on a numeric scale, and variables like sex and home.dest, which cannot.

Variables that can be measured on a numeric scale are called quantitative variables. Just because a variable happens to contain numbers does not necessarily make it "quantitative". For example, consider the variable survived in the Titanic data set. Each passenger either survived or didn't. This data set happens to use 1 for "survived" and 0 for "died", but these numbers do not reflect an underlying numeric scale.

Variables that are not quantitative but take on a limited set of values are called categorical variables. For example, the variable sex takes on one of two possible values ("female" or "male"), so it is a categorical variable. So is the variable home.dest, which takes on a larger, but still limited, set of values. We call each possible value of a categorical variable a "category". Although categories are usually non-numeric (as in the case of sex and home.dest), they are sometimes numeric. For example, the variable survived in the Titanic data set is a categorical variable with two categories (1 if the passenger survived, 0 if they didn't), even though those values are numbers. With a categorical variable, one common analysis question is, "How many observations are there in each category?".

Some variables do not fit neatly into either category. For example, the variable name in the Titanic data set is obviously not quantitative, but it is not categorical either because it does not take on a limited set of values. Generally speaking, every passenger will have a different name (the two James Kellys notwithstanding), so it does not make sense to analyze the frequencies of different names, as one might do with a categorical variable. We will group variables like name, that are neither quantitative nor categorical, into an "other" category.

Every variable can be classified into one of these three types: quantitative, categorical, or other. The type of the variable often dictates the kind of analysis we do and the kind of visualizations we make, as we will see later in this chapter.

pandas tries to infer the type of each variable automatically. If every value in a column (except for missing values) can be cast to a number, then pandas will treat that variable as quantitative. Otherwise, the variable is treated as categorical. To see the type that Pandas inferred for a variable, simply select that variable using the methods above and look for its dtype. A dtype of float64 or int64 indicates that the variable is quantitative. For example, the age variable above had a dtype of float64, so it is quantitative. On the other hand, if we look at the sex variable it is of type object.


To get a quick summary of a variable, we can use the .describe() function. Let's see what happens when we call .describe() on a quantitative variable, like age.



In [39]:
titanic.Age.describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

# Summary

* Tabular data is stored in a data structure called a DataFrame.
* Rows represent observations; columns represent variables.
* Single rows and columns are stored in a data structure called a Series.
* The row index should be a set of labels that uniquely identify observations.
* To select rows by label, we use .loc[]. To select rows by (0-based) position, we use .iloc[].
* To select columns, we can use .loc notation (specifying both the rows and columns we want, separated by a comma), key access, or attribute access.
* Variables can be quantitative, categorical, or other.
* Pandas will try to infer the type, and you can check the type that Pandas inferred by looking at the dtype.

# Homework

* Given the information provided here, add a section that explains how to find all the entries that satisfy a logical comparisson. For example, how can I get the list of people who survived the destruction of the Titanic. You might need to consult other sources, do add a citation for where you got your information.
* After completing the prior task. Write a line of code that returns the summary of the Fare for all the passengers who survived the Titanic and one for all the passengers who didn't survive.
* Write a short paragraph talking about what you conclude from your findings.
