# Introduction to Data Science in Python

These are my notes for DataCamp's course [_Introduction to Data Science in Python_](https://www.datacamp.com/courses/introduction-to-data-science-in-python).

This course is presented by Hillary Green-Lerman, Lead Data Scientist at Looker. The collaborator is Mona Khalil.

There are no prerequisites for this course.

This course is part of these tracks:

- Data Skills for Business

There are no data sets provided for this course. However, I was able to replicate the needed data in CSV files.

The course does an excellent job of pointing out the errors an inexperienced user might experience and how to fix them.

## Versions

For this notebook, I used:

- Python 3.12.7
- matplotlib 3.9.2
- numpy 2.1.3
- pandas 2.2.3

## Imports

Imports are placed here for convenience and clarity. 

In [None]:
import math

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Getting Started in Python

### Dive into Python

This course is intended for people with no experience with Python. It is the most introductory course that I've encountered so far. This video explains how to use the DataCamp interface and how to import modules.

#### Exercises

In [None]:
# Importing modules.
# See Imports above.

### Creating Variables

This video explains the rules for creating variables.

#### Exercises

In [None]:
# Store each piece of information in a variable.
bayes_age = 4.0
print(bayes_age)
favorite_toy = "Mr. Squeaky"
owner = "DataCamp"
print(favorite_toy)
print(owner)

In [None]:
birthday = "2017-07-14"
case_id = "DATACAMP!123-456?"

### Fun with Functions

The video provides a detailed example of how to create a function, including positional and keyword arguments.

Gertrude Mary Cox was an American statistician and founder of the department of Experimental Statistics at North Carolina State University. 

Kirstine Smith was a Danish statistician. She is credited with the creation of the field of optimal design of experiments.

#### Exercises

In [None]:
# Extra credit.
# I n the console, I entered the command !cat ransom.csv to print the contents
# of the ransom.csv file in the console. I copied the data into a file named
# ransom.csv and saved the file in my project folder.

# Load the data and view it.
r = pd.read_csv("ransom.csv")
r

In [None]:
# Create a line plot.
plt.plot(r["letter"], r["frequency"])
plt.show()

In [None]:
# Extra credit.
# The course provided a function, lookup_plate.
# I used inspect.getsource() to get the source code for this function.
# import inspect
# src = inspect.getsource(lookup_plate)
# print(src)
# This is the function.
def lookup_plate(plate_str, color=None):
    if type(plate_str) != str:
        print("Error! Please input a string!")
        return False
    elif len(plate_str) != 7:
        print("Error! License plate must have 7 characters. Use a * for missing characters.")
        return False
    elif plate_str == 'FRQ****':
        if color is None:
            print('''
            Fred Frequentist
            John W. Tukey
            Ronald Aylmer Fisher
            Karl Pearson
            Gertrude Cox
            Kirstine Smith
            ''')
            return True
        elif color == 'Green':
            print('''
            Fred Frequentist
            Ronald Aylmer Fisher
            Gertrude Cox
            Kirstine Smith
            ''')
            return True
        else:
            print('Error! No cars of that color found!')
            return False
    elif plate_str == 'EXAMPLE':
        print('''
        Christopher Eccleston
        Matt Smith
        David Tenant
        Peter Capaldi
        Jodie Whittaker
        ''')
        return True
    else:
        print("Error! Plate not found!")
        return False

# Use the function.
plate = "FRQ****"
lookup_plate(plate)

In [None]:
# Call the function with a second argument.
# We have a partial plate number and the plate's color.
lookup_plate(plate, "Green")

## Loading Data in pandas

### What is pandas?

pandas is a tool for working with tabular data. pandas can:
- Load tabular data from different sources
- Search for particular rows or columns
- Calculate aggregate statistics
- Combine data from multiple sources

pandas provides a new data type, the DataFrame.

One of the easiest ways to import data is from a CSV file.

Use `pandas.DataFrame.head` to look at the first few lines of a DataFrame. Use `pandas.DataFrame.info` to get information about the structure of the DataFrame.

#### Exercises

In [None]:
# Extra credit.
# In the console, I entered the command !cat credit_records.csv.
# I copied the data into a new file, credit_records.csv, and saved the file in
# the project directory.

# Load the data from credit_records.csv.
credit_records = pd.read_csv("credit_records.csv")
print(credit_records.head())
print()
print(credit_records.info())

### Selecting Columns

In [None]:
# Select columns in different ways.
# We can use dot notation as long as the column name doesn't contain a
# space character.
price = credit_records["price"]
print(price)
print()
# Get the column data using dot notation.
price2 = credit_records.price
print(price2)
# Show that the two objects are equivalent.
price.equals(price2)

#### Exercises

In [None]:
# Select data from the item column two different ways.
items = credit_records["item"]
print(items)
items2 = credit_records.item
items.equals(items2)

In [None]:
location = credit_records["location"]
print(location)
print()
location

In [None]:
# Extra credit.
# Create the CSV file missing_puppy_reports.csv for the mpr DataFrame.
# I used this command to format the data to copy and paste into the CSV file:
#    print(mpr.to_csv(index=False, quoting=csv.QUOTE_NONNUMERIC))
# When reading the file, one value was "", and pd.read_csv turned this into
# NaN. To match the data in the course, I had to use keep_default_na=False.
mpr = pd.read_csv("missing_puppy_reports.csv", keep_default_na=False)

# Print information about the mpr DataFrame.
print(mpr.info())
print()

# Subset the data.
name = mpr["Dog Name"]
is_missing = mpr["Missing?"]
print(name)
print()
print(is_missing)

### Selecting Rows with Logic

The video describes how to use logic operators such as `==` and `<` in Python.

#### Exercises

In [None]:
# Experimenting with logic operators.
height_inches = 65
print(height_inches > 70)
plate1 = "FRQ123"
print(plate1 == "FRQ123")
fur_color = "blonde"
print(fur_color != "brown")

In [None]:
# Extra credit.
# The mpr DataFrame changed for this exercise.
# Extract the data and put it into a CSV file.
# I used this command to format the data to copy and paste into the CSV file:
#    print(mpr.to_csv(index=False, quoting=csv.QUOTE_NONNUMERIC))
# Read the data.
mpr2 = pd.read_csv("missing_puppy_reports2.csv", keep_default_na=False)
mpr2

In [None]:
# Select data using logic queries. (I'm speaking SQL here.)
greater_than_2 = mpr2[mpr2["Age"] > 2]
print(greater_than_2)
print()

still_missing = mpr2[mpr2["Status"] == "Still Missing"]
print(still_missing)
print()

not_poodle = mpr2[mpr2["Dog Breed"] != "Poodle"]
print(not_poodle)

In [None]:
# This exercise uses a different credit_records DataFrame (boo!).
# I had to build a new CSV file, credit_records2.csv.
# I used this command in the console to obtain the data:
# print(credit_records.to_csv(index=False, quoting=csv.QUOTE_NONNUMERIC))
# I saved the data to credit_records2.csv in the project folder.

# Read the data from the CSV file.
credit_records2 = pd.read_csv("credit_records2.csv")
credit_records2

In [None]:
# Identify suspects who made purchases at Pet Paradise.
purchase = credit_records2[credit_records2.location == 'Pet Paradise']
purchase

## Plotting Data with Matplotlib

### Creating Line Plots

In [None]:
# Create a line plot.
plt.plot(r["letter"], r["frequency"])
plt.show()

#### Exercises

In [None]:
# Extra credit.
# Load the deshaun data into a DataFrame.
# I printed the data in the console using:
#    print(deshaun.to_csv(index=False))
# and saved it into the file deshaun.csv.
# I did the same for aditya.csv and 

# Read the data.
deshaun = pd.read_csv("deshaun.csv")
aditya = pd.read_csv("aditya.csv")
mengfei = pd.read_csv("mengfei.csv")
deshaun

In [None]:
# Plot hours worked vs. day of week for the three officers.
# We don't have labels or a legend yet.
plt.plot(deshaun["day_of_week"], deshaun["hours_worked"])
plt.plot(aditya["day_of_week"], aditya["hours_worked"])
plt.plot(mengfei["day_of_week"], mengfei["hours_worked"])
plt.show()

### Adding Text to Plots

In [None]:
# Create a line plot. Add axis labels and a title.
plt.plot(r["letter"], r["frequency"])
plt.xlabel("Letter")
plt.ylabel("Frequency")
plt.title("Ransom Note Letters")
plt.show()

In [None]:
# Create a line plot with labels and a legend.
plt.plot(aditya["day_of_week"], aditya["hours_worked"], label="Aditya")
plt.plot(deshaun["day_of_week"], deshaun["hours_worked"], label="Deshaun")
plt.plot(mengfei["day_of_week"], mengfei["hours_worked"], label="Mengfei")
plt.legend()
plt.show()

In [None]:
# Add arbitrary text at a specific location.
plt.plot(r["letter"], r["frequency"])
plt.xlabel("Letter")
plt.ylabel("Frequency")
plt.title("Ransom Note Letters")
plt.text(5, 9, "Unusually low H frequency!")
plt.show()

Change the font size of text by adding a fontsize=n argument.

Change the color of text by adding a color="colorname" argument. Matplotlib uses web colors: https://en.wikipedia.org/wiki/Web_colors

#### Exercises

In [None]:
# Add labels and a legend to the work effort plot.
# I used the title and Y label used by the tutorial.
plt.plot(deshaun["day_of_week"], deshaun["hours_worked"], label="Deshaun")
plt.plot(aditya["day_of_week"], aditya["hours_worked"], label="Aditya")
plt.plot(mengfei["day_of_week"], mengfei["hours_worked"], label="Mengfei")
plt.legend()
plt.title("Hours worked per day, by officer, on the Missing Puppy Report for Bayes")
plt.ylabel("Hours worked per day")
plt.show()

In [None]:
# Extra credit.
# I created the six_months.csv file from the output of the console command:
#    print(six_months.to_csv(index=False))
# It looks like the day for May is also missing; there is no column for May.

# Load the six_months.csv data file.
# Plot the data and add floating text.
# Just for practice, I added labels and a title, and I adjusted the
# range of the y axis.
six_months = pd.read_csv("six_months.csv")
plt.plot(six_months.month, six_months.hours_worked)
plt.ylabel("Hours worked")
plt.xlabel("Month")
plt.yticks(range(0, 210, 20))
plt.text(2.5, 80, "Missing June data")
plt.title("Hours worked")
plt.show()

### Styling Graphs

Change the color of a line using the `color` argument. Change the line's width using the `linewidth` argument. Change the style of a line using the linestyle argument.

In [None]:
# Demonstrate using these. This does not exactly replicate the plot in
# the video.
x = np.arange(0, 11)
y1 = x * 1
y2 = x * 2
y3 = x * 3
y4 = x * 4
plt.plot(x, y1, color="tomato", linewidth=1, linestyle="-", marker="x")
plt.plot(x, y2, color="orange", linewidth=2, linestyle="--", marker="s")
plt.plot(x, y3, color="goldenrod", linewidth=3, linestyle="-.", marker="o")
plt.plot(x, y4, color="seagreen", linewidth=4, linestyle=":", marker="d")
plt.show()

There are several different plot styles that can be used.

```
plt.style.use("fivethirtyeight")
plt.style.use("ggplot")
plt.style.use("seaborn")
plt.style.use("default")
```

In [None]:
# Extra credit.
# Plot a sine wave using four different styles.
# I couldn't find a way to change the styles in subplots.
# This code uses what I learned in "Introduction to Data Visualization with Matplotlib".

# List available styles.
print(plt.style.available)

# Create values.
x = np.linspace(0, 2 * math.pi, 40)
y = np.sin(x)

# Provide examples of all plotting styles.
for style_name in plt.style.available:
    plt.style.use(style_name)
    plt.plot(x, y)
    plt.title(style_name)
    plt.show()


#### Exercises

In [None]:
# Extra credit.
# Build the data DataFrame by reading the data from crime_data.csv.
# I obtained the data from the console using this command:
#    print(data.to_csv(index=False, quoting=csv.QUOTE_NONNUMERIC))
# I copied the data into crime_data.csv and saved the file in the project folder.
data = pd.read_csv("crime_data.csv")
data

In [None]:
# Plot the data, making various modifications.
plt.style.use("default")
plt.plot(data["Year"], data["Phoenix Police Dept"], label="Phoenix", color="DarkCyan")
plt.plot(data["Year"], data["Los Angeles Police Dept"], label="Los Angeles", linestyle=":")
plt.plot(data["Year"], data["Philadelphia Police Dept"], label="Philadelphia", marker="s")
plt.xlabel("Year")
plt.ylabel("Burglaries per 100 residents")
plt.legend()
plt.show()

In [None]:
# Set a global style and plot the data.
# My fivethirtyeight looks different from DataCamp's.
# I think the DataCamp UI is not changing the plot style.
plt.style.use("fivethirtyeight")
plt.plot(data["Year"], data["Phoenix Police Dept"], label="Phoenix")
plt.plot(data["Year"], data["Los Angeles Police Dept"], label="Los Angeles")
plt.plot(data["Year"], data["Philadelphia Police Dept"], label="Philadelphia")
plt.legend()
plt.show()

In [None]:
# Use the ggplot style.
plt.style.use("ggplot")
plt.plot(data["Year"], data["Phoenix Police Dept"], label="Phoenix")
plt.plot(data["Year"], data["Los Angeles Police Dept"], label="Los Angeles")
plt.plot(data["Year"], data["Philadelphia Police Dept"], label="Philadelphia")
plt.legend()
plt.show()

In [None]:
# Use the Solarize_Light2 style.
plt.style.use("Solarize_Light2")
plt.plot(data["Year"], data["Phoenix Police Dept"], label="Phoenix")
plt.plot(data["Year"], data["Los Angeles Police Dept"], label="Los Angeles")
plt.plot(data["Year"], data["Philadelphia Police Dept"], label="Philadelphia")
plt.legend()
plt.show()

In [None]:
# Extra credit.
# Create CSV files as before for the DataFrames suspect1 and suspect2.
#    print(suspect1.to_csv(index=False, quoting=csv.QUOTE_NONNUMERIC))
#    print(suspect2.to_csv(index=False, quoting=csv.QUOTE_NONNUMERIC))
# Each of these files contains an unneeded column that can be ignored.

# Plot the letter frequencies from the ransom note.
# The tutorial concludes that Fred Frequentist is the dognapper based
# on the similar frequencies of H and P.
suspect1 = pd.read_csv("suspect1.csv")
suspect2 = pd.read_csv("suspect2.csv")
ransom = r
plt.style.use("default")
plt.plot(ransom.letter, ransom.frequency,
         label="Ransom",
         linestyle=':',
         color='gray')
plt.plot(suspect1.letter, suspect1.frequency, label="Fred Frequentist")
plt.plot(suspect2.letter, suspect2.frequency, label="Gertrude Cox")
plt.xlabel("Letter")
plt.ylabel("Frequency")
plt.legend()
plt.show()

## Different Types of Plots

### Making a Scatter Plot

Make a scatter plot using `plt.scatter()`. Use the `alpha` argument to set the transparency of points plotted in the scatter plot.

#### Exercises

In [None]:
# Extra credit.
# Copy the data from the cellphone DataFrame into the file cellphone.csv.
# I used the following command to print the data in the console. I copied the
# text from the console and pasted it into cellphone.csv. I modified the header
# to remove the "Unnamed:0" column name.
#    print(cellphone.to_csv(index=False))
cellphone = pd.read_csv("cellphone.csv", index_col=0)
cellphone.head()

# Plot the cellphone coordinate data.
# The tutorial plots these points over an image that represents a map of the
# town. I don't know how to copy the image so I can use it.
plt.scatter(cellphone["x"], cellphone["y"], marker="s", color="red", alpha=0.1)
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.show()

### Making a Bar Chart

In [None]:
# Extra credit.
# Create the bar chart data.
data_dict = {
            "precinct" : ["Farmburg", "Cityville", "Suburbia"],
            "pets_abducted": [10, 15, 9],
            "error": [2, 3, 2]
}
df = pd.DataFrame(data_dict)
df

In [None]:
# Create a vertical barchart with errors.
plt.bar(df["precinct"], df["pets_abducted"], yerr=df["error"])
plt.ylabel("Pet abductions")
plt.show()

In [None]:
# Create a horizontal bar chart. Note changes in some arguments.
plt.barh(df["precinct"], df["pets_abducted"], xerr=df["error"])
plt.xlabel("Pet abductions")
plt.show()

In [None]:
# Extra credit.
# I created a DataFrame by reading the values from the figure in the video.
data_dict2 = {
            "precinct" : ["Farmburg", "Cityville", "Suburbia"],
            "dog": [4, 10, 3],
            "cat": [6, 5, 6],
}
df2 = pd.DataFrame(data_dict2)
df2

In [None]:
# Create a stacked bar chart.
plt.bar(df2["precinct"], df2["dog"], label="Dog")
plt.bar(df2["precinct"], df2["cat"], bottom=df2["dog"], label="Cat")
plt.ylabel("Pet abductions")
plt.legend()
plt.show()

#### Exercises

In [None]:
# Recreate the small data set.
column_names = ("officer", "desk_work", "field_work", "avg_hours_worked", "std_hours_worked")
hours_data = (
    ("Deshaun", 25, 20, 45, 3),
    ("Mengfei", 20, 13, 33, 9),
    ("Aditya", 12, 30, 42, 5)
)
hours = pd.DataFrame(hours_data, columns=column_names)
hours

In [None]:
# Create a bar chart.
plt.bar(hours["officer"], hours["avg_hours_worked"], yerr=hours["std_hours_worked"])
plt.show()

In [None]:
# Create a stacked bar chart.
plt.bar(hours["officer"], hours["desk_work"], label="Desk work")
plt.bar(hours["officer"], hours["field_work"], bottom=hours["desk_work"], label="Field work")
plt.legend()
plt.show()

### Making a Histogram

Give a data set in the DataFrame gravel, plot a histogram:
```
plt.hist(gravel["mass"], bins=40, range=(50, 100))
plt.show()
```

To normalize the data, use the argument `density=True`.
```
plt.hist(male_weight, density=True, alpha=0.3)
plt.hist(female_weight, density=True, alpha=0.3)
plt.show()
```

#### Exercises

In [None]:
# Extra credit.
# I obtained the puppy weight data by executing the following command in
# the console:
#    print(puppies.to_csv(index=False))
# I copied the data into puppy_weight.csv.
puppies = pd.read_csv("puppy_weight.csv")
puppies.head()

In [None]:
# Create a histogram.
plt.hist(puppies["weight"], bins=50)
plt.xlabel('Puppy Weight (lbs)')
plt.ylabel('Number of Puppies')
plt.show()

In [None]:
# Create a second histogram, zooming in on the range 5-35.
plt.hist(puppies["weight"], range=(5, 35))
plt.xlabel("Puppy weight (lbs)")
plt.ylabel("Number of puppies")
plt.show()

In [None]:
# Extra credit.
# I copied the data for the gravel DataFrame from the console, using this
# command:
#    print(gravel.to_csv(index=False))
# I copied the data into gravel.csv.
gravel = pd.read_csv("gravel.csv")
gravel

In [None]:
# Create a histogram of normalized data.
# The distribution is similar to the distribution from Shady Groves Campsite.
plt.hist(gravel["radius"], bins=40, range=(2, 8), density=True)
plt.xlabel('Gravel radius (mm)')
plt.ylabel('Frequency')
plt.title('Sample from Shoeprint')
plt.show()

### Recap of the Rescue