<a href="https://colab.research.google.com/github/ayang2689/DL_PyTorch/blob/master/Copy_of_Data_Science_Tools_of_the_Trade.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Tools of the Trade
- This is Google's Version of a Jupyter Notebook.
- Throughout this notebook we will be illustrating some basic Data Science and Machine learning concepts using a preloaded dataset.
- Anytime you see a line surrounded by triple asterisks, `***LIKE THIS***`, that is a line of code that you will need to replace or edit.
- Have fun and good luck coding!

> To execute a line or block of code, simply click the "Play" button on the left side or use the keyboard shortcut "Shift + Enter"
> When that code block has actually been executed, the blank brackets will change to have a number inside of them.

In [0]:
x = ***EDIT THIS CODE***
print(x)

# Importing Libraries
The robust number of Python libraries that have been developed make it easy to accomplish all kinds of tasks! In this notebook, we'll be taking advantage of:
- [Pandas](https://pandas.pydata.org/) for data wrangling and statistical analysis
- [Matplotlib](https://matplotlib.org/) for visualizing (plotting) our data
- [Sklearn](https://scikit-learn.org/stable/) for machine learning and predictive modeling

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

%matplotlib inline

# Importing Data
To "do" data science, we'll need some data to get started.
- Scikit-learn comes with several preloaded datasets that are **great** for practicing on.
- Pandas DataFrame objects make working with data straightforward, structured, and familiar.

In [0]:
iris = datasets.load_iris()
df = pd.concat([pd.DataFrame(iris.data, columns = iris.feature_names), 
                pd.DataFrame(iris.target, columns=['species'])], axis = 1)
df['species'] = df['species'].map({0:'setosa', 
                                   1:'versicolor', 
                                   2:'virginica'})

In [0]:
df.head(5)

# Examining One Column Visually
During the course of our analysis, we'll use visualizations to answer questions about our data and draw out different trends and patterns within it.
- A histogram is a quick way to visualize the distribution of a single variable, or column, from your dataset.
- Matplotlib has tons of ways to customize your chart to make it even more aesthetically appealing!

In [0]:
plt.hist(df['sepal length (cm)'], color = 'maroon', 
         edgecolor= 'k', alpha=0.65)
plt.title('Distribution of Sepal Length')
plt.xlabel('Sepal Length in cm')
plt.ylabel('Number of Occurences');

# Summary Stats
During the course of your analysis, you'll also use summary statistics to inspect and describe different aspects of your data.
- The mean of the `sepal length (cm)` column will be our measure of central tendency in this example.
- We'll also take a look at the standard deviation as our measure of spread.

In [0]:
df['sepal length (cm)'].mean()

In [0]:
df['sepal length (cm)'].std()

# Selecting Features and Output
We're just about ready for the predictive modeling phase of our project, but first we'll want to identify 2 different things:
1. **Feature matrix**: this is the group of columns (features/variables/attributes) that will be used to determine our outcome.
2. **Target array**: this is the series (target/label/outcome) that contains the information that we're ultimately trying to predict.

So in this case, we're using the `'sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)'` columns to predict what kind of `species` that particular row (or flower) belongs to!

In [0]:
X = df[['sepal length (cm)','sepal width (cm)',
        'petal length (cm)','petal width (cm)']]
y = df['species']

# Running a Model
Now it's time to actually fit our predictive model. Since this is a **classification** problem (3 distinct possible outcomes), we'll use logistic regression, which (despite the name) is one of the first classification models that you should learn about.
- Scikit-learn has _tons_ of different models that are available for use so that's where we'll bring in ours from.
- We'll then "fit" the model with our `X` and `y` from above (our feature matrix and target array).
- For each row in our `X`, the model learns the importance of each combination of values in the columns in determining the class distinction of our `y`.

In [0]:
from sklearn.linear_model import LogisticRegression

In [0]:
model = LogisticRegression()
model.fit(X, y)

# Making a prediction
Now that our model has learned from the historical data, it's time to make predictions about new or unseen data points! So we'll ask it to make a prediction on a single `new_flower`.
- This is a flower with sepal length of 5 cm , sepal width of 5 cm, petal length of 7 cm, and a petal width of 3 cm.
- When we execute the `model.predict` line, it will generate an output of one of the 3 species in our dataset.
- Try changing the values for our new flower to see how that prediction changes!

In [0]:
new_flower = [[5, 5, 7, 3]]
model.predict(new_flower)

# Welcome to the World of Data Science!
Congratulations on taking your first **HUGE** steps in your data science journey! Although there's still a lot to learn within each of the steps that we've outlined above, hopefully you're already starting to see the power of some of the tools that you'll have at your disposal in this field.

## Bonus Take Home Challenge
The best way to continue improving your proficiency is by taking on new concepts and practicing daily. See if you can implement any of the following tasks in the notebook above:

- Create a histogram of a different column or a scatter plot of two different columns.
- Generate some summary statistics for the same column(s) and a short note interpreting what those summary statistics mean.
- **Stretch goal**: use one other classification model from Scikit-learn on the data.

# Keep Learning with Thinkful
If you enjoyed today's session and want to take a deeper dive into many of the topics that we covered today like Pandas, SQL, predictive modeling, visualizing your data, and so much more, we'd love to have you join us again!
- Check out more of our webinars at [Thinkful Webinars](https://www.thinkful.com/webinars/)
- Learn more about the [Data Science Flex Course](https://www.thinkful.com/bootcamp/data-science/flexible/)