# An introduction to Data Analytics: Day 1
### The general steps to data analysis and modeling
 1. Problem definition
 1. Data extraction
 1. Data preparation - data cleaning
 1. Data preparation - data transformation
 1. Data exploration and visualization
 1. Predictive modeling
 1. Model validation/testing
 1. Visualization and interpretation of results
 1. Deployment of the solution (implementation of the solution in the real world)
 1. Documentation
 1. Monitoring

### Key disciplines
 - Data Engineering
 - Data Science
 - Data Repository Management

### In many ways, data analytics is just a specialized form of software development.  Let's take a few minutes to discuss some fundamentals of software development:
 - Project Organization
   - How do you/should you organize your projects on your workstation?
   - How do you name your folders?  What sort of casing do you use?  Do you include spaces in your names?
   - Do you "insulate" one project from another?  Virtual environments?  Docker?
 - How do you organize the artifacts in a given project?
   - How do you name your folders and files?  What casing do you use?
   - How do you name your variables, functions, classes and other objects in your code?  [Pythonic](https://docs.python-guide.org/writing/style/)?
   - Do you comment your code?
   - Do you write unit tests?
   - Do you employ logging in your code?
 - Are you using source code control?
   - How do you branch your code?
   - Do you leverage artifacts like README.md and .gitignore?
 - How do you manage code deployments?
   - CI/CD?
 - Some References:
   - [Clean Code](https://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882)
   - [A Philosophy of Software Design](https://www.amazon.com/dp/173210221X)
   - [Code Complete](https://www.amazon.com/Code-Complete-Practical-Handbook-Construction/dp/0735619670)
   - [Design Patterns](https://www.amazon.com/Design-Patterns-Elements-Reusable-Object-Oriented/dp/0201633612)

### Some of the tools we use to perform data analytics:
 - Spark
 - Hadoop
 - SQL / Relational databases
 - Docker / Kubernetes
 - Python
 - Jupyter Notebook

### The rest of this training will focus on Python and Jupyter Notebook.

### Jupyter Notebook
 - An HTML based developer environment that can run a number of languages known as "kernels".  The IPython kernel is the most popular, used to run different versions of Python.
 - Jupyter Notebooks typically, but not always, run in a web browser.
 - The "Next Generation" notebook interface is called Jupyter Lab.
 - A notebook is comprised of "cells" that can host code or markdown.
 - https://jupyter.org/
 - https://www.youtube.com/c/jupytercon
 - https://talkpython.fm/episodes/show/438/celebrating-jupyterlab-4-and-jupyter-7-releases

### Markdown
 - A powerful way to document your notebooks
 - Markdown cells honor Markdown syntax, HTML, and LaTeX.
 - https://www.markdownguide.org/

$E_0 = mc^2$

$\int_0^1 \frac{dx}{e^x} =  \frac{e-1}{e}$


### Notebook "modes", keyboard shortcuts, magic commands, and more
 - Edit mode and Command mode
 - [Several keyboard shortcuts](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/) and menu options
 - [Magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html)
 - Shell command

### Some helpful examples
 - [A gallery of interesting Jupyter Notebooks](https://github.com/jupyter/jupyter/wiki)
 - [How to Use Jupyter Notebooks: The Ultimate Guide](https://www.datacamp.com/tutorial/tutorial-jupyter-notebook)


In [2]:
%lsmagic

Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cd  %clear  %cls  %code_wrap  %colors  %conda  %config  %connect_info  %copy  %ddir  %debug  %dhist  %dirs  %doctest_mode  %echo  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %macro  %magic  %matplotlib  %mkdir  %more  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %pip  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %ren  %rep  %rerun  %reset  %reset_selective  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%cmd  %%code_wrap  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  

The "time" magic command is something I use a lot:

In [4]:
%%time
import time

print('Starting a long run...')
for i in range(10):
    time.sleep(1)

print('Completed long run.')

Starting a long run...
Completed long run.
CPU times: total: 0 ns
Wall time: 10.1 s


Shell commands in your notebooks can be helpful, too:

In [8]:
!python --version

Python 3.10.13


A code example:

In [1]:
# you can print code cells to "output" cell underneath
for i in range(5):
    print(f'This is loop {i}')

This is loop 0
This is loop 1
This is loop 2
This is loop 3
This is loop 4


### Let's review the instructions in README.md

### Homework
Go through the instructions in the README.md file.  Try to install Anaconda and VS Code on your workstation.  Clone this project from Github and try to create a virtual environment for it.