<a href="https://colab.research.google.com/github/apmon26/Python/blob/main/INFO212_Week1_1_Lecture_python_jupytper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 212: Data Science Programming 1
## CCI at Drexel University
### Yuan An, PhD
### Associate Professor

## Week 1-Lecture 1: Setup and Review: Python and Jupyter Notebooks
---
## **What should you learn from today's lecture?**
- Use Colab notebook
- Use Colab AI
- import conventions
- Markdown instructions
- Define a function: def fun()
- Variables, data types
- Strings
- if..elif..else
- for loop
- None type
- Open and save files from Google drive
- try...except
---


## Open Colab Notebook
- There two types of cells: code and text

## Set up Directories
I recommend you to put all your notebooks and datasets in a directory called "info212" in your Google Drive.
- Mount your drive to your notebook.
- Navigate the directory from your notebook.

### Exercise:
- Download this notebook to "info212". Upload and open it in colab.
- Download the datasets from the course shell. Print out the path of the file 'iris.csv'.

### Excercise
- In a code cell, write a function to compute the square root of a number. Test the function on -2, 3, 4 and print out the results.

### Text Cell
- In a text cell, you can add description for your work. It is essential to add sufficient description about your data analysis activities. Just like code comments are essentials during coding.
- Markdown is a simple language to define the styles if you use a notebook environment that doesn't have the styling menus.

### Excercise
- Create a text cell and add a public picture linked here: https://i.imgur.com/oGHBQwb.png

## Use Code Assistants
- AI tools are rich learning resources.
- Use AI tools to get the right syntax
- Use AI tools to fix bugs
- **However, you must learn the fundamentals to use AI tools effectively and correctly.**

### Exercise:
- Ask the AI tool to write a function to calculate the average of a list of numbers and test it on an empty list.

## What Does Data Science Programming Do?
- Write programs to scrape data from sources.
- Write programs to manipuate data.
- Store data in desired structures and formats.
- Write programs to answer meaningful questions involving complicated operations such as comparison, distillation, aggregation, and visualization.

## The primary focus of analysis is on structured data, such as
* Multidimensional arrays (matrices)
* Tabular or spreadsheet-like data in which each column may be a different type (string, numeric, date, or otherwise). This includes most kinds of data commonly stored in relational databases or tab- or comma-delimited text files
* Multiple tables of data interrelated by key columns
* Evenly or unevenly spaced time series

## Essential Python Libraries

The following is a list of essential python libraries in the scientific Python ecosystem:

### NumPy
NumPy, short for Numerical Python, is the foundational package for scientific computing
in Python.It provides, among other things:
* A fast and efficient multidimensional array object ndarray
* Functions for performing element-wise computations with arrays or mathematical operations between arrays
* Tools for reading and writing array-based data sets to disk
* Linear algebra operations, Fourier transform, and random number generation
* Tools for integrating connecting C, C++, and Fortran code to Python

### Pandas
pandas provides rich data structures and functions designed to make working with
structured data fast, easy, and expressive. It is, as you will see, one of the critical ingredients enabling Python to be a powerful and productive data analysis environment.
The primary object in pandas that will be used in this book is the DataFrame, a two-dimensional tabular, column-oriented data structure with both row and column labels.

### matplotlib
matplotlib is the most popular Python library for producing plots and other 2D data
visualizations.

### Seaborn
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

### SciPy
SciPy is a collection of packages addressing a number of different standard problem
domains in scientific computing.

### Scikit-Learn
Scikit-learn is a machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.

## Import Conventions
The Python community has adopted a number of naming conventions for commonly used
modules:

## Import libraries

```
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
```

```
# Print the version of the packages
print(matplotlib.__version__)
```

## Access External Data Sets
- Your data should be stored in the Google Drive
- For small data set, you can upload the data set to the notebook (in a virtual machine)

`from google.colab import files`

`files.upload()`

- For large data sets, mount the drive and use the paths

### Exercise:
- Open the file "iris.csv" in Google Drive and print out its content.

# Introdcution to Data Analysis Tasks

In this course we will learn the Python tools to work productively with data. The tasks required generally fall into a number of different broad groups:
* **_Interacting with the outside world_**
    - Reading and writing with a variety of file formats and databases.
    
* **_Preparation_**
    - Cleaning, munging, combining, normalizing, reshaping, slicing and dicing, and transforming data for analysis.
    
* **_Transformation_**
    - Applying mathematical and statistical operations to groups of data sets to derive new data sets. For example, aggregating a large table by group variables.
    
* **_Modeling and computation_**
    - Connecting your data to statistical models, machine learning algorithms, or other computational tools.
    
* **_Presentation_**
    - Creating interactive or static graphical visualizations or textual summaries.


# **Although today's AI tools are powerful enough that can automatically carry out the tasks, high-level human intelligence and intervention are essential in data analysis.**

# What should a data scientist be able to do?
- Master the programming fundamentals
- Understand the business needs and ask meaingful questions


### An Example for Describing Data and Asking Meaningful Questions

- Kaggle dataset: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis
- Problem Statement
 - Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.



Attributes about Customers
- ID: Customer's unique identifier
- Year_Birth: Customer's birth year
- Education: Customer's education level
- Marital_Status: Customer's marital status
- Income: Customer's yearly household income
- Kidhome: Number of children in customer's household
- Teenhome: Number of teenagers in customer's household
- Dt_Customer: Date of customer's enrollment with the company
- Recency: Number of days since customer's last purchase
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise

Attributes about Products
- MntWines: Amount spent on wine in last 2 years
- MntFruits: Amount spent on fruits in last 2 years
- MntMeatProducts: Amount spent on meat in last 2 years
- MntFishProducts: Amount spent on fish in last 2 years
- MntSweetProducts: Amount spent on sweets in last 2 years
- MntGoldProds: Amount spent on gold in last 2 years


Attributes about Promotion:
- NumDealsPurchases: Number of purchases made with a discount
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

Attributes about Place:
- NumWebPurchases: Number of purchases made through the company’s website
- NumCatalogPurchases: Number of purchases made using a catalogue
- NumStorePurchases: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to company’s website in the last month

### Questions
- Retrieve the top-3 youngest married and single customers who accepted the offers in fewer than 3 times of promotions.
- What are average numbers of Web purchases for customers who accepted the promotions at different times?
- Discover the relationship between the customers who spent on more wines and more fruits; categorize customers in terms of different attributes

# Python Reviews

#### Comments
Any text preceded by the hash mark (pound sign) # is ignored by the Python interpreter.
This is often used to add comments to code. At times you may also want to
exclude certain blocks of code without deleting them. An easy solution is to comment
out the code

#### Functions
A function is a block of code which only runs when it is called. You can pass data, known as parameters, into a function. A function can return data as a result.

In Python a function is defined using the def keyword. To call a function, use the function name followed by parenthesis. Information can be passed into functions as arguments.

```
# Example
def my_function(name):
  print("Hello " + name + " from a function")

my_function("John")
```

#### Variables and argument passing

When assigning a variable (or name) in Python, you are creating a reference to the
object on the righthand side of the equals sign. In practical terms, consider a list of
integers

```
data = [1, 2, 3]

append_element(data, 4)

data
```

When you pass objects as arguments to a function, new local variables are created referencing
the original objects without any copying. If you bind a new object to a variable
inside a function, that change will not be reflected in the parent scope. It is
therefore possible to alter the internals of a mutable argument. Suppose we had the
following function:
```python
def append_element(some_list, element):
    some_list.append(element)
```

#### Imports
In Python a module is simply a file with the .py extension containing Python code.
Suppose that we had the following module:

```python
# some_module.py
PI = 3.14159

def f(x):
    return x + 2

def g(a, b):
    return a + b
```

If we wanted to access the variables and functions defined in some_module.py, from
another file in the same directory we could do:
```
import some_module
result = some_module.f(5)
pi = some_module.PI
```

```
from some_module import f, g, PI
result = g(5, PI)
```

```
import some_module as sm
from some_module import PI as pi, g as gf

r1 = sm.f(pi)
r2 = gf(6, pi)
```

#### Mutable and immutable objects
Most objects in Python, such as lists, dicts, NumPy arrays, and most user-defined
types (classes), are mutable. This means that the object or values that they contain can
be modified:

strings and tuples, are immutable:

In Python, keys of dictionay must be immutable. tuples can be keys of dictionary.

---



# Python Data Types

### Scalar Types
Python along with its standard library has a small set of built-in types for handling
numerical data, strings, boolean (True or False) values, and dates and time. These
“single value” types are sometimes called scalar types and we refer to them in this
book as scalars.

#### Numeric types

#### Strings
Many people use Python for its powerful and flexible built-in string processing capabilities.
You can write string literals using either single quotes ' or double quotes ":

```
a = 'one way of writing a string'
b = "another way"
```

The syntax s[:3] is called slicing and is implemented for many kinds of Python
sequences.

The backslash character \ is an escape character, meaning that it is used to specify
special characters like newline \n or Unicode characters.

#### Booleans
The two boolean values in Python are written as True and False. Comparisons and
other conditional expressions evaluate to either True or False. Boolean values are
combined with the and and or keywords:

#### Type casting
The str, bool, int, and float types are also functions that can be used to cast values
to those types:

#### None
None is the Python null value type. If a function does not explicitly return a value, it
implicitly returns None:

```
def add_and_maybe_multiply(a, b, c=None):
    result = a + b

    if c is not None:
        result = result * c

    return result
```

#### Dates and times
The built-in Python datetime module provides datetime, date, and time types. The
datetime type, as you may imagine, combines the information stored in date and
time and is the most commonly used:

In [None]:
from datetime import datetime, date, time
dt = datetime(2011, 10, 29, 20, 30, 21)
dt.day
dt.minute

30

Given a datetime instance, you can extract the equivalent date and time objects by
calling methods on the datetime of the same name:

In [None]:
dt.date()

datetime.date(2011, 10, 29)

In [None]:
dt.time()

datetime.time(20, 30, 21)

The strftime method formats a datetime as a string:

In [None]:
dt.strftime('%m/%d/%Y %H:%M')

Strings can be converted (parsed) into datetime objects with the strptime function:

In [None]:
datetime.strptime('20091031', '%Y%m%d')

When you are aggregating or otherwise grouping time series data, it will occasionally
be useful to replace time fields of a series of datetimes—for example, replacing the
minute and second fields with zero:

In [None]:
dt.replace(minute=0, second=0)

The difference of two datetime objects produces a datetime.timedelta type:

In [None]:
dt2 = datetime(2011, 11, 15, 22, 30)
delta = dt2 - dt
delta
type(delta)

Adding a timedelta to a datetime produces a new shifted datetime:

In [None]:
dt
dt + delta

### Exercise:
- Print out the weekday of today.
- Print out the numerical order of today's date in this year.

### Control Flow
Python has several built-in keywords for conditional logic, loops, and other standard
control flow concepts found in other programming languages.

#### if, elif, and else

```
if x < 0:
    print('It's negative')
```

In [None]:
x = -2

In [None]:
if x <0:
    print("it's negative")

it's negative


```
if x < 0:
    print('It's negative')
elif x == 0:
    print('Equal to zero')
elif 0 < x < 5:
    print('Positive but smaller than 5')
else:
    print('Positive and larger than or equal to 5')
```

If any of the conditions is True, no further elif or else blocks will be reached. With
a compound condition using and or or, conditions are evaluated left to right and will
short-circuit:

It is also possible to chain comparisons:

In [None]:
4 > 3 > 2 > 1

#### for loops
for loops are for iterating over a collection (like a list or tuple) or an iterater. The
standard syntax for a for loop is:

```
for value in collection:
    # do something with value
````

You can advance a for loop to the next iteration, skipping the remainder of the block,
using the continue keyword. Consider this code, which sums up integers in a list and
skips None values:
```
sequence = [1, 2, None, 4, None, 5]
total = 0
for value in sequence:
    if value is None:
        continue
    total += value
```

A for loop can be exited altogether with the break keyword. This code sums elements
of the list until a 5 is reached:
```
sequence = [1, 2, 0, 4, 6, 5, 2, 1]
total_until_5 = 0
for value in sequence:
    if value == 5:
        break
    total_until_5 += value
```

As we will see in more detail, if the elements in the collection or iterator are sequences
(tuples or lists, say), they can be conveniently unpacked into variables in the for
loop statement:
```
for a, b, c in iterator:
    # do something
```

#### while loops
A while loop specifies a condition and a block of code that is to be executed until the
condition evaluates to False or the loop is explicitly ended with break:

```
x = 256
total = 0
while x > 0:
    if total > 500:
        break
    total += x
    x = x // 2
```

#### pass
pass is the “no-op” statement in Python. It can be used in blocks where no action is to
be taken (or as a placeholder for code not yet implemented); it is only required
because Python uses whitespace to delimit blocks:

```
if x < 0:
    print('negative!')
elif x == 0:
    # TODO: put something smart here
    pass
else:
    print('positive!')
```

#### range
The range function returns an iterator that yields a sequence of evenly spaced
integers:

In [None]:
range(10)
list(range(10))

In [None]:
list(range(0, 20, 2))
list(range(5, 0, -1))

seq = [1, 2, 3, 4]
for i in range(len(seq)):
    val = seq[i]

sum = 0
for i in range(100000):
    # % is the modulo operator
    if i % 3 == 0 or i % 5 == 0:
        sum += i

### Exercise:
- Write a for loop to print out the numbers between 0-20 indicating whether the number is even or odd.

## Exception in Python

In [None]:
fileName = 'empty_lines.txt'
try:
    with open(fileName, 'r') as f:
        line = f.readlines()
        print(lines)
except FileNotFoundError:
    print("{} desn't exist".format(fileName))

['this is a Line\n', '\n', 'this is a Line\n', '\n', '\n', '\n', 'this is a Line\n']


In [None]:
filename = 'empty_lines'

In [None]:
try:
    with open(filename, 'r') as f:
        line = f.readlines()
        print(lines)
except OSError:
    print("{} doesn't exist".format(filename))

empty_lines doesn't exist


# References

Python for Data Analysis by Wes McKinney. Publisher: O'Reilly Media.