# Data Analysis with Python by IBM on Coursera - Notebook Cheatsheet

This notebook is structured to provide a cheatsheet, revisions, explanations, exercises, and projects related to the "Data Analysis with Python" course by IBM on Coursera. Each section is separated by cells for clarity and organization. Code examples are provided for demonstration, and you are encouraged to practice in the subsequent cells.

## Table of Contents

1. Collecting and Importing Data
2. Cleaning, Preparing & Formatting Data
3. Data Frame Manipulation
4. Summarizing Data
5. Building Machine Learning Regression Models
6. Model Refinement
7. Creating Data Pipelines

---

## 1. Collecting and Importing Data

### Cheatsheet & Explanation

- **Pandas** is the primary library used for data analysis and manipulation.
- Use `pd.read_csv()`, `pd.read_excel()`, `pd.read_json()`, etc., to import data from various formats.

### Code Example

```python
import pandas as pd

# Reading a CSV file
data = pd.read_csv('path/to/your/data.csv')

# Display the first 5 rows of the dataframe
print(data.head())



### Practice in the Next Cell



In [62]:
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
column_names = ['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration', 'num_doors', 'body_style', 
                'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 
                'engine_type', 'num_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke', 'compression_ratio', 
                'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price']

data = pd.read_csv(url, names=column_names)
#print(data.head())
data


Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470




---

## 2. Cleaning, Preparing & Formatting Data

### Cheatsheet & Explanation

- **Data Cleaning** involves handling missing values, removing duplicates, and correcting errors.
- **Data Preparation** includes type conversion and creating new columns.

### Code Example



In [61]:
# Handling missing values
data.dropna(inplace=True)  # Remove rows with missing values
length = len(data)
#print(length)

# Removing duplicates
data.drop_duplicates(inplace=True)
length = len(data)
#print(length)


data

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470




---

## 3. Data Frame Manipulation

### Cheatsheet & Explanation

- Use `.loc[]` and `.iloc[]` for label-based and integer-based indexing, respectively.
- Column operations, like adding and deleting columns, are straightforward in pandas.

### Code Example



In [64]:
# Selecting rows by label
selected_rows = data.loc[data['normalized_losses'] == '?']

#data.drop('Phone1', axis=1, inplace=True)

# Adding a new column
#data['Phone New'] = data['Phone 2'] + " " + "(new)"
#data
selected_rows

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
5,2,?,audi,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250
7,1,?,audi,gas,std,four,wagon,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,18920
9,0,?,audi,gas,turbo,two,hatchback,4wd,front,99.5,...,131,mpfi,3.13,3.4,7.0,160,5500,16,22,?
14,1,?,bmw,gas,std,four,sedan,rwd,front,103.5,...,164,mpfi,3.31,3.19,9.0,121,4250,20,25,24565
15,0,?,bmw,gas,std,four,sedan,rwd,front,103.5,...,209,mpfi,3.62,3.39,8.0,182,5400,16,22,30760
16,0,?,bmw,gas,std,two,sedan,rwd,front,103.5,...,209,mpfi,3.62,3.39,8.0,182,5400,16,22,41315
17,0,?,bmw,gas,std,four,sedan,rwd,front,110.0,...,209,mpfi,3.62,3.39,8.0,182,5400,15,20,36880




---

## 4. Summarizing Data

### Cheatsheet & Explanation

- Use `.describe()` to get a summary of the statistics of the dataframe.
- `.groupby()` can be used for aggregating data based on one or more columns.

### Code Example



In [58]:
# Summary statistics
print(data.describe())

# Grouping data
#grouped_data = data.groupby('column_to_group_by').mean()

            Index
count  100.000000
mean    50.500000
std     29.011492
min      1.000000
25%     25.750000
50%     50.500000
75%     75.250000
max    100.000000




---

## 5. Building Machine Learning Regression Models

### Cheatsheet & Explanation

- **Scikit-learn** is a popular library for building machine learning models.
- Split your dataset into training and testing sets to evaluate the performance of your model.

### Code Example



In [59]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(data[['input_feature']], data['target'], test_size=0.2)

# Building the model
model = LinearRegression()
model.fit(X_train, y_train)

# Making predictions
predictions = model.predict(X_test)

KeyError: "None of [Index(['input_feature'], dtype='object')] are in the [columns]"



### Practice in the Next Cell



In [None]:
# Try building a linear regression model with a dataset of your choice



---

## 6. Model Refinement

### Cheatsheet & Explanation

- Use cross-validation to assess the performance of your model more reliably.
- Hyperparameter tuning can be done using GridSearchCV or RandomizedSearchCV.

### Code Example



In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

# Cross-validation
scores = cross_val_score(model, X, y, cv=5)

# Hyperparameter tuning
parameters = {'parameter_name': [list_of_values]}
grid_search = GridSearchCV(model, parameters, cv=5)
grid_search.fit(X_train, y_train)



### Practice in the Next Cell



In [None]:
# Try refining your model by performing cross-validation and hyperparameter tuning



---

## 7. Creating Data Pipelines

### Cheatsheet & Explanation

- Data pipelines streamline the process of data transformation and model training.
- Use `Pipeline` from scikit-learn to create a sequence of data processing and model training steps.

### Code Example



In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# Creating a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Using the pipeline
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)



### Practice in the Next Cell



In [None]:
# Try creating a data pipeline for a dataset and model of your choice



---

This notebook serves as a starting point for your journey in data analysis with Python. Remember, practice is key to mastering these concepts. Good luck!
```