# GETTING STARTED WITH Python (Jupyter Notebook)

Alexander White

## Python (Jupyter Notebook):  HOW TO GET IT AND USE IT


Here are steps for you to get Python (Jupyter Notebook) installed:
- Go to [https://www.anaconda.com/download](https://www.anaconda.com/download)
- Run the Anaconda installer and follow the instructions
- See documentation at: [Anaconda Navigator](https://docs.anaconda.com/free/navigator/index.html) and [Jupyter documentation](https://jupyter.readthedocs.io/en/latest/) for further help getting Python/Jupyter up and running


## Learning objectives


- What are Jupyter notebooks, and what is their history?
- Learn how to start Jupyter notebook – create a new notebook
- Exploring the Jupyter notebook GUI
- Understand how to create a data frame using pandas
- Understand how to import data using pandas, calculate a new variable in a pandas data frame



NOTE: This document assumes that you have already successfully installed Jupyter notebook from the Anaconda distribution. If you have not done this, please follow those steps before proceeding, and if you have questions, seek out the Faculty or TA through email or office hours.



## What are Jupyter notebooks, and what is their history?
Jupyter notebooks provide a place where an analyst/programmer can intermingle code, results from running code chunks (called cells), and text (including mathematical and statistical equations) using a variety of programming languages and a syntax called markdown. Jupyter actually means Julia, Python, and R. 

We will make use of the jupyter notebook application, a server-client application that allows you to run Python code within your web browser. This will be installed by default with the Anaconda Python distribution. The jupyter notebook has two main components, the dashboard and kernel. The kernel essentially refers to the underlying programming language engine, and we will make use of Python 3.11 in this course. The dashboard refers to the client side web page and interface, including the management of previously created notebooks, and turning on and off kernels that you would like to use. You may have multiple kernels active at any given time. 

Python has its beginning in the early 1980s, but notebooks for Python didn’t start being developed until the early 2000s, when IPython was created. Jupyter actually spun out of a project called Project Jupyter, and IPython is the backend used by Jupyter notebook today when using a Python kernel. Notebooks for some languages have existed since the 1980s, but Jupyter notebook is a more recent phenomenon dating from around 2014.



## Exploring the Jupyter notebook GUI – Keyboard shortcuts
Details about the Jupyter notebook GUI and keyboard shortcuts are provided here.

* Command Mode (escape)
    + Insert cell above (a)
    + Insert cell below (b)
    + Change cell to markdown (m)
    + Change cell to code (y)
    + Change cell to raw (r)
* Edit Mode (enter)
    + Execute a cell (ctrl + enter)


For more information regarding keyboard shortcuts within Jupyter notebook, please see the following link: [Jupyter Notebook Cheat Sheet](https://images.datacamp.com/image/upload/v1676302533/Marketing/Blog/Jupyterlab_Cheat_Sheet.pdf)



## Understand how to create a data frame using pandas
Details about creating data frames using pandas are provided here, including importing the necessary libraries and creating data frames from scratch.


In [None]:
import pandas as pd

myDF = pd.DataFrame({'name': ['Barry', 'Megan', 'Alec', 'Tim', 'Harriet', 'Lindsay'], 
                     'age': [12, 13, 15, 17, 19, 11]})
myDF

Select a column

In [None]:
myDF['age']

To select a specific value, just use another [] (remember that indices start at 0 in Python):

In [None]:
myDF['age'][0] ## first element

In [None]:
myDF['age'][:-1] ## all but last element

Add a new row (observation) to a pandas data frame

In [None]:
myDF2 = pd.DataFrame({'name': ["New", "People"], 'age': [13, 17]})
pd.concat([myDF, myDF2], ignore_index=True)

Notice that if ignore_index were not set to True, the indices would start back over at 0 where myDF2 is appended to myDF. You might try this in your own session to confirm that it is true.

In [None]:
myDF ## notice that myDF was not modified


## Understand how to import data using pandas, calculate a new variable in a pandas data frame
Details about importing data using pandas and calculating new variables in pandas data frames are provided here.


In [None]:
myDF['newID'] = [x for x in range(6)]
myDF['gender'] = ['Male', 'Female', "Male", 'Male',
                 "Female", "Female"] ## notice two lines, and " vs '
myDF

## Wait.... I don't want to manually input my data here

In [None]:
income_statement = pd.read_csv("income_statement.csv")

In [None]:
income_statement

In [None]:
pd.read_csv?

## Descriptive

Let's "wrangle" the data a bit

In [None]:
basic_info = income_statement[['Revenues', 'Total Expenses', 'Net Profit']].agg(['min', 'max', 'mean'])
basic_info

In [None]:
income_statement_by_year = income_statement.groupby('Year')[['Revenues', 'Total Expenses', 'Net Profit']].sum()
by_year = income_statement_by_year.applymap(lambda x: "${:,.2f}".format(x))
by_year

Here's what's happening:

\\${:,.2f} is a string formatting pattern:

\\$ adds the dollar sign.

:, adds comma as a thousand separator.

.2f ensures there are exactly two decimal places.

applymap() applies the given function (which utilizes the string format) to each cell of the DataFrame.

## Export
We want to send these to our client, let's export them.

In [None]:
by_year.to_csv('income_by_year.csv', index = False) # `index=False` ensures that row indices are not written to the CSV

## Visualizations

In [None]:
import matplotlib.pyplot as plt
plt.plot(income_statement['Month'], income_statement['Revenues'], label='Revenues', color='blue')
plt.legend()  # This will show the 'Revenues' label on the plot.
plt.show()    #

What's happening here? Oh, it's stacking my years together... OK, let's create a Date variable instead.

In [None]:
income_statement['Date'] = pd.to_datetime(income_statement['Year'].astype(str) + '-' + income_statement['Month'].astype(str) + '-01')

# Sort by this new 'Date' column
income_statement = income_statement.sort_values(by='Date')

In [None]:
income_statement

In [None]:
plt.plot(income_statement['Date'], income_statement['Revenues'], label='Revenues', color='blue')
plt.legend()  # This will show the 'Revenues' label on the plot.
plt.show()    #

In [None]:
avg_expenses = income_statement[['COGS', 'Salaries and Wages', 'Rent', 'Marketing and Advertising', 'Utilities', 'Depreciation', 'Other Expenses']].mean()
avg_expenses.plot(kind='bar', color='skyblue')

Plot Revenues, Total Expenses, and Net Profit over time as line graphs

In [None]:
# Revenues
plt.plot(income_statement['Date'], income_statement['Revenues'], label='Revenues', color='blue')

# Total Expenses
plt.plot(income_statement['Date'], income_statement['Total Expenses'], label='Total Expenses', color='red')

# Net Profit
plt.plot(income_statement['Date'], income_statement['Net Profit'], label='Net Profit', color='green')

plt.title('Income Statement Overview')
plt.xlabel('Date')
plt.ylabel('Amount')

# Add a legend to show labels
plt.legend()

# Display the plot
plt.show()

In [None]:
total_expenses = income_statement[['COGS', 'Salaries and Wages', 'Rent', 'Marketing and Advertising', 'Utilities', 'Depreciation', 'Other Expenses']].sum()
total_expenses.plot(kind='bar', color='lightcoral')



## Additional Resources
For more review and experience, I suggest the following tutorials:

- [Python tuple tutorial](https://www.datacamp.com/community/tutorials/python-tuples-tutorial)
- [Python list tutorial](https://www.datacamp.com/community/tutorials/18-most-common-python-list-questions-learn-python)
- [Python numpy tutorial](https://www.datacamp.com/community/tutorials/python-numpy-tutorial)
- [Python pandas tutorial](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python)
