<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/Data_Science_01_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is Data Science?
## Brendan Shea, PhD

**Data Science** is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It leverages a combination of techniques from statistics, mathematics, computer science, and domain-specific knowledge to achieve this goal. Essentially, it's about using data to generate meaningful and actionable insights that can inform decision-making in various sectors.

Let's break down these components:

1.  **Statistical Methods:** Data science relies heavily on statistics as a way to make sense of data. Statistics provide tools to describe and summarize data (descriptive statistics) and to make claims or predictions about the data (inferential statistics). For example, a transportation company may use statistics to analyze travel times, number of passengers, fuel efficiency, etc.

2.  **Programming:** This is the tool that allows data scientists to interact with data. By writing code, data scientists can collect, clean, analyze, and visualize data, as well as implement machine learning models. Python and R are popular languages in this field due to their readability and the wide array of data-focused libraries available.

3.  **Domain Expertise:** While the above skills provide the technical foundation for data science, domain expertise is crucial for interpreting results and making sound decisions. This involves understanding the sector in which the data analysis is being applied. For example, in the transportation sector, understanding factors like peak travel times, common routes, and regulatory policies can greatly inform a data science project.

For example, imagine a city government is trying to improve its public transportation system. A data scientist might be enlisted to help answer questions like:

-   "What are the peak hours for bus usage, and do we have sufficient buses to handle that load?"
-   "Are there neighborhoods that are underserved by our current bus routes?"
-   "How does weather impact the usage of public transportation?"

To answer these questions, the data scientist would collect data (perhaps from bus fare systems, GPS systems on buses, weather stations, etc.), clean and process the data, then use statistical methods to analyze it. They might use programming skills to automate the collection of data and to create visualizations of bus routes and usage patterns. Finally, they would use their understanding of the city and its transportation needs (the domain expertise) to interpret their findings and make recommendations.


## Applications of Data Science
ata science has wide-ranging applications across virtually every domain. Let's take a look at a few more examples:

1.  **Healthcare:** In the healthcare sector, data science can be used for predictive modeling to identify disease trends and risk factors, optimize patient care paths, and even predict patient readmissions. For example, by analyzing patient records, clinical notes, and other health data, a data scientist can develop models to predict which patients are at higher risk of readmission after surgery. Such insights could then be used to improve post-operative care and reduce readmissions.

2.  **Finance:** Data science plays a vital role in financial institutions for fraud detection, risk management, investment modeling, and customer segmentation. For instance, data scientists can build models to identify unusual patterns of transactions that might indicate fraudulent activities.

3.  **Marketing and Sales:** In this domain, data science can be applied to forecast sales, analyze customer behavior, optimize pricing, and improve product recommendations. An e-commerce company might use data science to predict which products a customer is likely to buy based on their browsing history and previous purchases, thereby offering more personalized product recommendations.

4.  **Agriculture:** Data science has also found applications in farming, where it can help predict crop yields, optimize resource usage, and even identify plant diseases. For instance, by analyzing satellite images, weather data, and soil information, a data scientist could help farmers identify the optimal time to plant and harvest crops, potentially increasing productivity.

5.  **Sports:** In the sports industry, data science can be used for player performance analysis, game strategy, injury prediction, and fan engagement. A data scientist working with a basketball team, for example, could analyze player statistics to determine optimal lineups and strategies.

6.  **Education:** Data science can be used to predict student performance, reduce dropout rates, personalize learning, and even improve curriculum design. For example, by analyzing data on student attendance, assignment scores, and engagement in online learning platforms, a data scientist could identify students who are at risk of falling behind and need additional support.

In essence, data science is a versatile field with the potential to impact and transform virtually any domain. It's all about extracting insights from data to make informed decisions and predictions. And the beauty of it is that the fundamental techniques and approaches remain the same, whether you're analyzing bus routes, patient records, or basketball stats. It's the context and the specific questions you're asking that change.

## Why Python?
Python is a high-level, interpreted programming language known for its clear syntax and readability, which makes it a great choice for beginners in programming. "High-level" means that Python has a strong abstraction from the details of the computer system it runs on, making it easier to write and understand than lower-level languages. "Interpreted" means that Python code is not converted to machine language before it is run; instead, the Python interpreter reads and executes the code directly.

Python is widely used in data science for several reasons:

1.  **Ease of Learning:** Python's syntax is designed to be readable and straightforward, which makes it an excellent language for beginners. Its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages like C++ or Java.

2.  **Versatile Libraries:** Python has a broad range of libraries that are specific to data science. Libraries like NumPy and Pandas are used for data manipulation and analysis, Matplotlib and Seaborn for data visualization, and Sci-Kit Learn for machine learning. These libraries have pre-written code to help perform a lot of common tasks without having to code them from scratch.

3.  **Community Support:** Python has a strong community in the field of data science. This means that if you run into problems or need to learn new techniques, there are plentiful resources and forums available to help.

4.  **Integration:** Python can easily integrate with other languages and tools. This makes it flexible for different use cases in the data science workflow, from data gathering and cleaning to modeling and visualization.

Let's return to our earlier example (of the transportation department). Suppose a city's transportation department wants to analyze traffic data to optimize signal timings at intersections. They have data on vehicle counts, pedestrian counts, and timings of signal changes.

A data scientist could use Python to perform this analysis. They might use the Pandas library to load the traffic data into a DataFrame (a table-like data structure), and to clean and process the data (e.g., handling missing data, converting data types). They might then use Matplotlib or Seaborn to visualize the data, helping them understand patterns in traffic flow at different times of day. They might even use Sci-Kit Learn to build a machine learning model that predicts traffic flow based on time of day, day of the week, and weather conditions. Throughout this process, Python's clear syntax and powerful libraries make the data scientist's job easier.

## Basic Python Syntax
In comparison to many other languages, Python has a straightforward syntax that beginners find easy to use, and that non-experts can use in their day-to-day lives. At the same time, it is (or can be) very powerful. Here, we'll explore some basic ideas.

### Comments

 In Python, you can add **comments** to your code that the interpreter will ignore during execution. These are used to explain what your code is doing, and they are especially useful when working with others or when you might revisit your code in the future. To create a comment, use the hash symbol (`#`) before your comment.

 Comments can be used to write **pseudocode**, or a plain-langauge description of what our code does (or we hope that it does!). For example:

In [None]:
# This is a comment - it doesn't do anything
print("Hello, World") # This line prints "Hello, World!"

Hello, World


 ### Variables
 In Python, a **variable** is a named location used to store data in the memory. You create a variable by assigning a value to it with an equals sign (`=`). The variable can then be used elsewhere in your code to refer to that value. Python is dynamically-typed, which means you don't have to declare the data type of the variable when you create it, unlike some other languages. Common data types include:

| Data Type | Description | Example |
| ----------- | ----------- | ----------- |
| int | Integer | A whole number, such as 1, 100, or -12345. |
| float | Floating-point number | A number with a decimal point, such as 1.2345, -3.1415, or 0.0001. |
| str | String | A sequence of characters, such as "Hello, world!", "This is a string", or '12345'. |
| bool | Boolean | A value that can be either True or False. |
| list | List of values | A sequence of values that can be heterogeneous, meaning it can contain values of different data types. Examples of lists are [1, 2, 3], ["Hello", "world!"], and [1.2345, -3.1415]. |




Here's an example of declaring and using variables:

In [None]:
# Assigning values to variables
speed = 60  # This is an integer variable
time = 2.5  # This is a float variable
vehicle = 'Bus'  # This is a string variable

# Using the variables
distance = speed * time  # This calculates distance by multiplying speed and time
print(distance) # prints the result

150.0


In the above example, 'speed', 'time', and 'vehicle' are variables, which hold values 60, 2.5, and 'Bus', respectively. The 'distance' variable is calculated by multiplying the 'speed' and 'time' variables.

## Basic Math

Python uses basic arithmetic operators for performing mathematical operations, and these can be used with numeric variables or directly with numbers.

-   Addition (+): `3 + 4` will give `7`
-   Subtraction (-): `10 - 3` will give `7`
-   Multiplication (*): `4 * 5` will give `20`
-   Division (/): `10 / 2` will give `5.0`
-   Floor Division (//): `10 // 3` will give `3`. This operator performs integer or floor division, i.e., it returns the largest possible integer.
-   Modulus (%): `10 % 3` will give `1`. This operator returns the remainder of the division.
-   Exponentiation (\*\*): `2 ** 3` will give `8`. This operator raises the number to its left to the power of the number to its right.

For example:

In [None]:
# Variables
a = 10
b = 3

# Arithmetic operations
addition = a + b  # results in 13
subtraction = a - b  # results in 7
multiplication = a * b  # results in 30
division = a / b  # results in 3.3333333333333335
floor_division = a // b  # results in 3
modulus = a % b  # results in 1
exponentiation = a ** b  # results in 1000

# Print the results
print("Addition:", addition)
print("Subtraction:", subtraction)
print("Multiplication:", multiplication)
print("Division:", division)
print("Floor division:", floor_division)
print("Modulus:", modulus)
print("Exponentiation:", exponentiation)


Addition: 13
Subtraction: 7
Multiplication: 30
Division: 3.3333333333333335
Floor division: 3
Modulus: 1
Exponentiation: 1000


## Calling a Function with Parameters

A **function** in Python is a reusable block of code that performs a specific task. You can use a function by "calling" it by its name, followed by parentheses. Some functions accept **parameters**, which are values that the function uses to perform its task.

A very basic example of a function is the `print()` function. The `print()` function outputs the value of its parameter(s) to the screen. For example:

In [None]:
print("Hello, World")

Hello, World


In this example, the string "Hello, world!" is a parameter to the `print()` function. When you run this code, Python calls the `print()` function, passing in "Hello, world!" as the parameter, and the `print()` function outputs "Hello, world!" to the screen.

Functions can take multiple parameters. For instance, the `print()` function can take multiple parameters and will print all of them, separated by spaces. For example:

In [None]:
print("Hello","World")

Hello World


This code will output: Hello, world!

### Basic Python Functions Useful in Data Science

Python includes many built-in functions that can be particularly useful in data science. Let's take a look at a few examples:

| Function | Description |
| --- | --- |
| `len(x)` | Returns the length (the number of items) of an object `x`. This is often used when you want to know how many items are in a list, string, dictionary, or other Python collection. |
| `sum(x)` | Returns the sum of all items in an iterable `x` (like a list or a tuple). Useful when you need to add up all the numbers in a collection. |
| `min(x)` | Returns the smallest item in an iterable `x`. Handy for finding the smallest number in a list, the earliest date in a timeline, or the shortest string in a collection, etc. |
| `max(x)` | Returns the largest item in an iterable `x`. Opposite of `min(x)`, it's useful for finding the largest number, the latest date, the longest string, etc. |
| `type(x)` | Returns the type of an object `x`. This can be helpful for understanding the data you're working with, especially when you're dealing with different data types (like integers, floats, strings, lists, etc.). |
| `round(x, n)` | Returns a floating point number that is a rounded version of the specified number `x`, with `n` number of decimals. This is useful when you need to display a number with a specific number of deci |

Here are some examples:

In [None]:
# Define a list of numbers
numbers = [4.75, 3.12, 7.64, 5.29, 6.99, 1.75, 8.64, 3.33]

# Print the length of the list
print("Length of the list:", len(numbers))

# Print the sum of the numbers
print("Sum of the numbers:", sum(numbers))

# Print the minimum number
print("Minimum number:", min(numbers))

# Print the maximum number
print("Maximum number:", max(numbers))

# Print the type of the list
print("Type of the list:", type(numbers))

# Print a rounded version of the first number
print("Rounded first number:", round(numbers[0], 2))


Length of the list: 8
Sum of the numbers: 41.51
Minimum number: 1.75
Maximum number: 8.64
Type of the list: <class 'list'>
Rounded first number: 4.75


## Jupyter Notebooks and Google Colab

This textbook is provided as a set of **Jupyter Notebooks** written using **Google Colab.** (If you are reading a hard copy or PDF, the link should be in the introduction). Jupyter notebooks and Google Colab are interactive environments that allow you to write and execute Python code in "cells", making the coding process more flexible and controllable. They have become very popular tools in data science due to their user-friendly nature, as well as their unique blend of features.

A Jupyter notebook is a web-based interface that allows for the creation of documents that can contain live code, equations, visualizations, and explanatory text. This combination of elements makes them excellent for creating and sharing data science projects, as they allow for code, visual output, and explanatory text to be bundled together.

Google Colab, short for Colaboratory, is a free Jupyter notebook environment that runs entirely in the cloud. It allows you to write and execute Python code right in your browser, with no setup required and easy access to resources like GPUs.

These are widely-used tools in data science, for a number of reasons:

1.  Interactive Programming: Both Jupyter notebooks and Google Colab offer an interactive programming environment where you can write a piece of code and run it to see its output immediately. This makes it easier to experiment, debug, and make incremental progress on your work.

2.  Rich Output: Jupyter notebooks and Google Colab support rich output, meaning your code can produce more than just text. Outputs like charts, images, tables, or even interactive widgets can be created and rendered right in your notebook. This is particularly useful in data science, where visualizing data is often key to understanding it.

3.  Documentation and Sharing: With Jupyter notebooks and Google Colab, you can intersperse code with text explanations, mathematical equations, and markdown formatting to document your work. This makes your work more understandable for others (and for yourself, when you look back on your work later!). Additionally, these notebooks can be easily shared, published, and even turned into slideshows or web pages, making them a great tool for collaborative work and reporting.

4.  Accessibility and Compatibility: Jupyter notebooks can be run on any operating system (Windows, MacOS, Linux). Google Colab is even more accessible -- as a web application, it can be accessed from any device with an internet connection and a web browser. This also means you don't need a powerful computer to do intensive computations, as you can utilize Google's hardware.

5.  Integration with Data Science Tools: Both Jupyter notebooks and Google Colab are compatible with many popular data science libraries, such as NumPy, Pandas, Matplotlib, and Scikit-Learn, allowing you to use these powerful tools right within your notebooks.

In summary, Jupyter notebooks and Google Colab provide an interactive, versatile, and accessible platform for performing and sharing data science work, which makes them invaluable tools in this field.

## Basic Operations in Google Colab

Google Colab is a very user-friendly tool and doing basic operations is quite straightforward. Let's go through some of these basic operations.

1.  Adding a Text Cell: To add a text cell, you can either click on `+ Text` in the top left corner of the screen or use the shortcut `Ctrl + M` then `T`. Text cells can contain notes, formatted text via markdown, mathematical equations written in LaTeX, HTML code, and even images.

2.  Adding a Code Cell: To add a code cell, you can either click on `+ Code` in the top left corner or use the shortcut `Ctrl + M` then `C`. Code cells are used to write and run Python code.

3.  Editing a Cell: To edit a cell, simply click on the cell and start typing. For text cells, clicking on the cell will open the markdown editor. For code cells, clicking on the cell will allow you to start typing code.

4.  Running a Code Cell: To run a code cell, you can click on the play button on the left of the cell or use the shortcut `Ctrl + Enter`. If you want to run the cell and then automatically select the next cell, you can use `Shift + Enter` instead.

5.  Saving a Notebook: Google Colab automatically saves your notebook to Google Drive as you work, but if you want to manually save, you can use the shortcut `Ctrl + S` or click on `File` in the menu, then `Save`.

6.  Downloading a Notebook: To download your notebook, you can go to `File` in the menu, then `Download .ipynb` or `Download .py`. The `.ipynb` format is a Jupyter Notebook, and the `.py` format is a Python script.

## Exercises

Now, let's practice everything you've learned so far with some exercises:

1.  Create a new Google Colab notebook and name it "Practice Notebook".

2.  Add a text cell at the top of your notebook, and write a brief introduction about yourself in it. Make sure to include a title for your notebook using markdown (Hint: use `#` for a title).

3.  Add a code cell and write a simple Python program that prints "Hello, world!". Run the cell and make sure it prints out correctly.

4.  Add a text cell under your program, and write a brief explanation of what your program does.

5.  Create a Python list of 5 numbers of your choice. Use the `len()`, `sum()`, `min()`, `max()`, `type()`, and `round()` functions on your list and print out the results.

6.  Write explanations of what each function does with your list in a new text cell.

7.  Make sure to save your notebook when you're finished!

This exercise should give you some practice using Google Colab and writing Python code. Remember, the best way to learn is by doing, so don't hesitate to play around and experiment!

## Getting to Know the mtcars Dataset
OK, let's get start exploring some data. We'll start with the `mtcars` dataset, a treasure trove of information about 32 different car models, specifically the 1973--74 models featured in the 1974 issue of Motor Trend US magazine. It's a rich dataset filled with various attributes of these cars, including:

1.  `mpg`: How many miles can the car travel per US gallon? More is generally better!
2.  `cyl`: The number of cylinders in the car's engine. More cylinders usually mean more power!
3.  `disp`: The car's displacement, or the total volume of all the cylinders in its engine, measured in cubic inches.
4.  `hp`: The gross horsepower of the car, a measure of how powerful the car's engine is.
5.  `drat`: The rear axle ratio, which affects the car's performance and fuel efficiency.
6.  `wt`: How much does the car weigh (in 1000 lbs units)?
7.  `qsec`: The time the car takes to cover a quarter of a mile. Faster is usually more fun!
8.  `vs`: A binary variable indicating the engine shape (V/S).
9.  `am`: Is the car automatic (0) or manual (1)?
10. `gear`: How many forward gears does the car have?
11. `carb`: How many carburetors does the car have?

Why are we using the mtcars dataset?  There are several reasons:

1.  Manageable Size: The 'mtcars' dataset isn't too big or too small. It's just the right size for you to get your hands dirty without feeling overwhelmed.

2.  Variety of Data Types: This dataset is like a mini buffet of data types. You'll get to work with continuous, discrete, and categorical variables - a perfect playground to start practicing!

3.  Real-world Data: While it's a bit old (cars have changed quite a bit since the 1970s), this data is from real cars. So, you'll be applying your new data science skills to real-world examples, making your learning experience more relatable and practical.

4.  Exploration Opportunities: The 'mtcars' dataset has enough depth to keep things interesting. You can try out different analysis techniques, ask various questions, and discover meaningful insights.

Learning with the 'mtcars' dataset is like learning to drive in a safe, controlled environment before hitting the highway. It's designed to give you a solid foundation and make your first steps in data science as enjoyable and educational as possible! So, buckle up and let's dive in!

## Loading the mtcars Dataset
**Data loading** is the first, and one of the most crucial steps, in your data science journey. Here, we fetch the data from a specific source and load it into our Python environment.

So, let's go ahead and load our first dataset, the 'mtcars' dataset!

In [1]:
!pip install pydataset -q # Install required packages
from pydataset import data # Import required modules
import pandas as pd # More on this below

dataset = data('mtcars') # Load the mtcars dataset

dataset.head() # display first five rows


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pydataset (setup.py) ... [?25l[?25hdone
initiated datasets repo at: /root/.pydataset/


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


What's happening in this code?  Let's break it down:

1.  `!pip install pydataset -q`: This line uses pip, Python's package manager, to install the 'pydataset' package. The 'pydataset' package contains a variety of datasets, including 'mtcars'. The `-q` flag stands for 'quiet', which means pip won't display all the download details. This keeps our notebook clean and tidy.

2.  `from pydataset import data` & `import pandas as pd`: These lines import the necessary modules for our task. The 'data' function from the 'pydataset' package allows us to load datasets. 'pandas' is a powerful library for data manipulation and analysis in Python, and we import it with the alias 'pd' for convenience.

3.  `dataset = data('mtcars')`: Here, we're calling the 'data' function and passing 'mtcars' as the argument. This function fetches the 'mtcars' dataset and loads it into our Python environment. We assign this dataset to the variable 'dataset' so we can work with it in the future.

4.  `dataset.head()`: Finally, we use the 'head' function to display the first five rows of our dataset. This is like peeking into the dataset, giving us a glimpse of what we're working with.

You've successfully loaded your first dataset! This process of loading and inspecting data is fundamental in any data science project. It sets the stage for all the exciting analyses to come. Now, get ready to dive deeper into data exploration in the upcoming sections!