# Learning Data Science

This notebook is my first attempt at learning data science with Python. Databases, Python, and Frameworks like SQLalchemy, Pandas, Numpy, SciKit, etc.

## Learning materials

In this document, I am working through the examples provided in the book, *[Practical Data Science with Jupyter](https://nokia.percipio.com/books/f016a0a0-91bd-4ebf-852c-54f929f9f446)* by Prateek Gupta, published in 2021. The book is available in [Skillsoft Percipio](https://nokia.percipio.com/) so it is free for Nokians to read. It is concisely written and serves as a good book for beginners who want to learn how to use Python-related tools in data science.  

## Set up your environment

Use Jupyterlab to develop and present your data science projects. To use Jupyterlab, you must:

* Install Python on your Windows laptop
* Create a new folder for your data science learning projects
* Create a Python virtual environment and activate it
* Install JupyterLab in your virtual environment and run JupyterLab

### Install Python on your Windows laptop

There are many ways to [install Python on Windows](https://learn.microsoft.com/en-us/windows/python/beginners#install-python). I recommend installing it from the *Microsoft Store*. Open the Microsoft Store app and search for Python. Install the latest version. [Python 3.11](https://apps.microsoft.com/store/detail/python-311/9NRWMJP3717K?hl=en-ca&gl=ca&activetab=pivot%3Aoverviewtab&rtc=1) was the latest version at the time I wrote this notebook so all examples use Python 3.11.

### Create a new folder

Open the [Windows Terminal](https://apps.microsoft.com/store/detail/windows-terminal/9N0DX20HK701?hl=en-us&gl=us) app.

Create a new folder that will store your virtual environment, notebook files, and other files you create while learning about data science. Navigate to your home folder, or to a subdirectory of your choice and create a folder named *data-science*:

```powershell
> mkdir data-science
```

### Create a Python virtual environment

Navigate into the folder you created and create a Python virtual environment. I chose to call mine, *env*.

```powershell
> cd data-science
> python -m venv env
```

Next, activate the virtual environment.

```powershell
> .\env\Scripts\activate
(env) > 
```

### Install and run JupyterLab

Install JupyterLab in the virtual environment. JupyterLab installs a lot of modules so you should keep it in a virtual environment to avoid messing up your Windows system.

```powershell
(env) > pip install jupyterlab
```

Run JupyterLab.

```powershell
(env) > jupyter notebook
```

The terminal will show multiple URLs that you can copy and paste into a web browser. Because you installed JupyterLab in a virtual environment, use one of the *localhost* URLs, like the following

```
http://localhost:8888/?token=678b6891879b80fc02488701d553b1a2b4
```

The token is needed the first time you use a new browser to access JupyterLab. After using it once, it is cached in the browser and you can connect to JupyterLab in the future with just the simple URL: 

```
http://localhost:8888
```

The Jupyter Notebook web interface looks like the image below.

![Jupyter Notebook user interface](./Images/JupyterNotebook001.png)

### Alternative interfaces

#### Jupyter Lab
You can also use the JupyterLab interface, instead. If you want to open notebooks using the JupyterLab interface, start the Jupyter server with the following command, instead of using the `jupyter notebook` command:

```
(env) > jupyter-lab
```

JupyterLab is the newer interface and has more features but the original Notebook interface is all most people need.

#### VS Code

Also, you can [edit Jupyter Notebooks in the VS Code editor](https://code.visualstudio.com/docs/datascience/jupyter-notebooks). Install the VSCode [Python extension]() and [Jupyter extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter), then open the Notebook file in VS Code. The virtual environment must already be activated and the JupyterLab server must already be started.

## Get Data

To learn about the basics of data science, you need data. Eventually, you need to learn how to work with many sources of data, such as:

* Databases. Mostly SQL
* Excel and CSV files stored on a secure SharePoint site
* APIs of external services
* Web scraping

Most of the data you will access in the Analytics team will be from a database or a file on a SharePoint site. Most of our data comes from the HRDP data lake so, instead of starting with simpler examples featuring CSV files, for which there are already [many](https://alongrandomwalk.com/2020/09/14/read-and-write-files-with-jupyter-notebooks/) [tutorials](https://www.digitalocean.com/community/tutorials/data-analysis-and-visualization-with-pandas-and-jupyter-notebook-in-python-3) [available](https://www.datacamp.com/tutorial/python-excel-tutorial), you should first learn how to [access data from a database](https://realpython.com/tutorials/databases/).

### SQLAlchemy

[SQLAlchemy](https://www.sqlalchemy.org/) is a Python package that helps programmers interact with SQL databases, without having to learn the SQL Query language. If you use the SQLAlchemy package, which has a learning curve of its own, then you do not need to worry about learning the [SQL language differences](https://towardsdatascience.com/how-to-find-your-way-through-the-different-types-of-sql-26e3d3c20aab) between the various SQL databases like MySQL, PostgreSQL, and Microsoft SQL Server, that different data sources may use.

In your virtual environment, install SQLAchemy with the following command:

```powershell
(env) > pip install SQLAlchemy
```

Once you start using databases in Python, you start learning more about advanced Python features such as [object-oriented programming](https://www.freecodecamp.org/news/object-oriented-programming-in-python/), [data classes](https://docs.python.org/3.11/library/dataclasses.html) and [type hints](https://towardsdatascience.com/type-hints-in-python-everything-you-need-to-know-in-5-minutes-24e0bad06d0b). I leave these topics to your own study.

### Data sources

I wish we could start experimenting with Python and SQLAlchemy using an already-existing public SQL database but, for good reasons, SQL databases are not made available directly on the Internet. They are accessed via APIs or simply downloaded as CSV files. Many such [datasets that are](https://learnsql.com/blog/free-online-datasets-to-practice-sql/) [available to the public](https://www.dropbase.io/post/top-11-open-and-public-data-sources#:~:text=Top%2011%20Open%20and%20Public%20Data%20Sources%20for,News%20...%208%208.%20NASA%20...%20More%20items) but they don't meet our needs of learning to directly access an SQL database from a Python program.

Another option is to create a database on the public SQL server at [db4free.net](https://www.db4free.net/). But you will need to build a sample database from scratch and I want to avoid learning SQL at this early stage. I want to access the database via Python using SQLAlchemy. But, this is an option if you want to take the time to learn the basics of SQL.

You may also install an SQL server like [Microsoft Access](https://www.microsoft.com/en-us/microsoft-365/access) on your laptop and then download a database backup from a public repository like the [Microsoft Northwind SQL Sample database](https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/sql/linq/downloading-sample-databases#northwind_access). I will try that.

### Microsoft Access

You already have access to Microsoft Access via Nokia's Office 365 corporate license. Install Access if you do not already have it.

Start Microsoft Access. Then, get the [Northwind sample database for Microsoft Access](https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/sql/linq/downloading-sample-databases#northwind_access). 

Click on the *More templates* link on the Access screen. Search for *Northwind* in the *Search for Online Templates* field. The Northwind database should appear, as shown below:

![Find online Northwind database file](./Images/access001.png)

Select the Northwind database to download it. You will see an information screen like the one below. Change the filename to *Northwind* and select the folder to which it will be downloaded. The click the *Create* button.

![Download and save the database](./Images/access002.png)

Microsoft Access will display its view of the database with a welcome screen, as shown below. We are not interested in using teh MS Access interface. We want to connect to this database using SQLAlchemy.

### SQLAlchemy Access dialect

SQLAlchemy needs a *dialect* installed so you can connect it to Microsoft Access. Install the [sqlalchemy-access](https://pypi.org/project/sqlalchemy-access/) python package into your virtual environment.

```powershell
(env) > pip install sqlalchemy-access
```

In [45]:
from sqlalchemy import create_engine, URL
import pyodbc
#engine = create_engine(r'access+pyodbc:///?DataSource=C:/Users/blinklet/Documents/Northwind_be.accdb')
#engine = create_engine(r'access+pyodbc:///@Northwind?driver=MS+Access+Database?TrustedConnection=yes')
# url_object = URL.create(
#     "access+pyodbc",
#     host = "localhost",
#     database = "@northwind"
# )
#engine = create_engine(url_object)

engine = create_engine(r'access+pyodbc://@localhost:3309/northwind')

In [46]:
from sqlalchemy import inspect
print(engine)
inspector = inspect(engine)

for table_name in inspector.get_table_names():
   for column in inspector.get_columns(table_name):
       print("Column: %s" % column['name'])

Engine(access+pyodbc://@localhost:3309/northwind)


InterfaceError: (pyodbc.InterfaceError) ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')
(Background on this error at: https://sqlalche.me/e/20/rvf5)