# Python Basics for Data Analysis (30 mins)
Python is an interpreted and object-oriented language, which means it executes code line-by-line and supports objects, classes, and methods. It is highly readable and simple, making it an excellent choice for beginners and data analysis.


In [1]:
# Python code can be executed directly without compilation
print("Hello, World!")

Hello, World!


## Variables, Data Types, and Basic Operations

### Variables and Assignment
Variables are containers for storing data values.

In [2]:
# Assign values to variables
x = 10
name = "Alice"
is_active = True

# Print variables
print(x, name, is_active)

10 Alice True


In [None]:
# Object-oriented: All variables in Python are objects

x = 42
print(x.real)  # Accessing attributes of the int object
print(x + 10)  # Using methods/operators on an object
print(x.__le__(50)) # accessing internal methods

y = "Hello"
print(y.upper())  # Accessing string methods


## Data Types
Python has various data types such as integers, floats, strings, booleans, and collections like lists and dictionaries.

In [7]:
# Integer and Float
age = 25
height = 5.9

# String
greeting = "Hello, Python!"

# Boolean
is_valid = False

# List
numbers = [1, 2, 3, 4, 5]

# Dictionary
person = {"name": "Alice", "age": 25}

print(age, height, greeting, is_valid, numbers, person)

25 5.9 Hello, Python! False [1, 2, 3, 4, 5] {'name': 'Alice', 'age': 25}


Python's flexibility as a weakly typed language allows operations between different data types without explicit type declarations.

In [8]:
# Variables with different types
age = 25
height = 5.9
greeting = "Hello, Python!"
is_valid = False
numbers = [1, 2, 3, 4, 5]
person = {"name": "Alice", "age": 25}

# Implicit type conversion and operations
print(age + height)  # Adds int and float: 25 + 5.9 -> 30.9
print(greeting + " Age is " + str(age))  # Concatenates string and int
print(numbers * 2)  # Duplicates the list: [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
print(person["name"] + " is " + str(person["age"]) + " years old.")  # String manipulation
print(bool(age) and is_valid)  # Combines int and bool in a logical operation: False

# Display all variables
print(age, height, greeting, is_valid, numbers, person)


30.9
Hello, Python! Age is 25
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
Alice is 25 years old.
False
25 5.9 Hello, Python! False [1, 2, 3, 4, 5] {'name': 'Alice', 'age': 25}


## Loops and conditional statements

In [None]:
### If-Else

# Conditional statements
x = 10
if x > 5:
    print("x is greater than 5")
else:
    print("x is 5 or less")


### For Loop

# Looping through a list
numbers = [1, 2, 3, 4, 5]
for num in numbers:
    print(num)


### While Loop

# Loop until a condition is met
count = 0
while count < 5:
    print("Count is", count)
    count += 1


Some interesting nuances in loops and specific data structures in python.

### zip: Merge two lists into a dictionary.

In [None]:
keys = ["name", "age", "city"]
values = ["Alice", 25, "New York"]

result = dict(zip(keys, values))
print(result)  # {'name': 'Alice', 'age': 25, 'city': 'New York'}

### List comprehension: Elegant and concise iteration.

In [None]:
squares = [x ** 2 for x in range(1, 6)]
print(squares)

### Tokenization Example Using Enumerate

In [None]:
# Tokenizing the string into words
text = "Python is a great programming language."
tokens = text.split()  # ['Python', 'is', 'a', 'great', 'programming', 'language.']

# Assigning indices to tokens using enumerate
tokenized = {index: token for index, token in enumerate(tokens)}

print(tokenized)

### Set: Removing duplicates from a list

In [None]:
items = [1, 2, 3, 1, 2, 4, 5, 4, 6]

# Use set to create a unique list
unique_items = list(set(items))

print("Original List:", items)
print("Unique List:", unique_items)

### for with else: Search for a specific file in a directory.

In [None]:
import os
directory = "/path/to/directory"
for file in os.listdir(directory):
    if file.endswith(".txt"):
        print(f"Found a text file: {file}")
        break
else:
    print("No text files found.")

### Use of Range with For

In [None]:
# Months of the year
months = ["January", "February", "March", "April", "May", "June",
          "July", "August", "September", "October", "November", "December"]

# Use range to iterate through months
for i in range(len(months)):
    print(f"Generating report for {months[i]}...")

## File and Directory Access

### Reading and Writing Files

In [None]:
# Writing to a file
with open('example.txt', 'w') as file:
    file.write("Hello, this is a test file.")

# Reading from a file
with open('example.txt', 'r') as file:
    content = file.read()
    print(content)


### Listing Directory Contents

In [None]:
import os

# List files and directories in the current directory
print(os.listdir('.'))

### Creating and Removing Directories

In [None]:
# Create a directory
os.mkdir('test_directory')

# Remove a directory
os.rmdir('test_directory')

### Visualizing Directory Structure with Tree Command
To view the directory structure in your terminal, you can use the `tree` command (install it if not available)

# Essential libraries

# Numpy

### NumPy Arrays
NumPy arrays are more efficient than Python lists for numerical operations. You can create arrays using np.array().


In [None]:
import numpy as np
# Creating a NumPy array
array = np.array([1, 2, 3, 4, 5])
print("Array:", array)
print("Type:", type(array))

# 1D Array
arr_1d = np.array([1, 2, 3])
print("1D Array:", arr_1d)

# 2D Array
arr_2d = np.array([[1, 2], [3, 4]])
print("2D Array:\n", arr_2d)

# 3D Array
arr_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print("3D Array:\n", arr_3d)


# Arithmetic with numpy arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Arithmetic operations
print("Addition:", a + b)
print("Multiplication:", a * b)
print("Scalar Multiplication:", a * 2)


### Indexing and Slicing

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Access specific element
print("Element at (0,1):", arr[0, 1])

# Slice rows and columns
print("First row:", arr[0, :])
print("First column:", arr[:, 0])

# Modify elements
arr[1, 1] = 99
print("Modified Array:\n", arr)


In [None]:
# Transposing arrays
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("Original Array:\n", arr)
print("Transposed Array:\n", arr.T)

### Some musings on speed of numpy

NumPy is significantly faster than Python's list and loop-based operations. The main benefit comes from vectorization, which eliminates the need for explicit loops by performing element-wise operations directly, and parallel computing capabilities, where NumPy leverages optimized, low-level C and Fortran libraries to perform operations efficiently.

In [15]:
# comparing array speeds of numpy versus regular python operations

import numpy as np
import time

# Generate data
size = 1_000_000
list1 = list(range(size))
list2 = list(range(size))

array1 = np.array(list1)
array2 = np.array(list2)

# Using Python lists
start_time = time.time()
result_list = [x + y for x, y in zip(list1, list2)]
end_time = time.time()
print(f"Python list addition took: {end_time - start_time:.5f} seconds")

# Using NumPy arrays
start_time = time.time()
result_array = array1 + array2
end_time = time.time()
print(f"NumPy array addition took: {end_time - start_time:.5f} seconds")


Python list addition took: 0.02499 seconds
NumPy array addition took: 0.00071 seconds


In [17]:
speed_upgrade = 0.02499/0.00071
print(speed_upgrade)

35.197183098591545


In [None]:

# A practical use of NumPy is to calculate similarity among various vectors. We'll cover this more deeply later on
# It is the fastest way to compute similarity for vectors derived from large 'embeddings' generated by various text sources, 
# such as word embeddings, sentence embeddings, or document embeddings in natural language processing (NLP). 
# NumPy's vectorized operations, like dot product and normalization, make it highly efficient for comparing embeddings 

In [16]:
import numpy as np
import time
import math

# Generate large random arrays
size = 1_000_000
vec1 = np.random.rand(size)
vec2 = np.random.rand(size)

# Using Python lists for cosine similarity
vec1_list = vec1.tolist()
vec2_list = vec2.tolist()

def cosine_similarity_lists(v1, v2):
    dot_product = sum(x * y for x, y in zip(v1, v2))
    norm_v1 = math.sqrt(sum(x ** 2 for x in v1))
    norm_v2 = math.sqrt(sum(y ** 2 for y in v2))
    return dot_product / (norm_v1 * norm_v2)

# Using NumPy for cosine similarity
def cosine_similarity_numpy(v1, v2):
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    return dot_product / (norm_v1 * norm_v2)

# Measure time for Python lists
start_time = time.time()
cos_sim_list = cosine_similarity_lists(vec1_list, vec2_list)
end_time = time.time()
print(f"Cosine Similarity (Lists): {cos_sim_list:.5f}, Time: {end_time - start_time:.5f} seconds")

# Measure time for NumPy
start_time = time.time()
cos_sim_numpy = cosine_similarity_numpy(vec1, vec2)
end_time = time.time()
print(f"Cosine Similarity (NumPy): {cos_sim_numpy:.5f}, Time: {end_time - start_time:.5f} seconds")


Cosine Similarity (Lists): 0.75007, Time: 0.10440 seconds
Cosine Similarity (NumPy): 0.75007, Time: 0.00126 seconds


In [18]:
speed_upgrade = 0.10440/0.00126
print(speed_upgrade)

82.85714285714286


In [1]:
import numpy as np

# Weekly sales data (rows: weeks, columns: categories)
sales_data = np.array([
    [150, 200, 250, 300, 350],
    [160, 210, 240, 310, 360],
    [155, 220, 260, 290, 370],
    [165, 230, 270, 280, 340],
    [170, 190, 220, 310, 330],
    [180, 195, 245, 320, 310],
    [175, 185, 235, 305, 325],
    [190, 215, 250, 315, 345],
])

### Exercise for numpy

Shape and Data Overview: Print the shape of the array and the total sales for the store (sum of all elements).

Slicing Data: Extract the sales data for the last 4 weeks and first 3 product categories.

Category-wise Totals: Calculate the total sales for each product category across all weeks (hint: sum along the correct axis).

Find the Week with the Highest Sales: Identify the week (row index) with the highest total sales.

Filter by Threshold: Print the weeks where sales of product category 3 (index 2) exceeded 250.

Normalize Sales Data: Normalize the sales data by dividing each element by the maximum sales recorded in any category.

Hints
Hint 1: sales_data.shape gives the dimensions of the array.
Hint 2: Use sales_data[-4:, :3] to slice the last 4 rows and first 3 columns.
Hint 3: Use np.sum along axis for category-wise totals.
Hint 4: Use np.argmax and np.sum to find the index of the row with the highest sales.
Hint 5: Use Boolean indexing
Hint 6: Use sales_data / np.max(sales_data) for normalization.


# Pandas
Pandas is the big daddy of data manipulation, offering powerful tools for cleaning, transforming, and analyzing large datasets with ease. Its intuitive DataFrame and Series structures simplify handling tabular data, while built-in functions enable efficient filtering, grouping, and aggregation.

The fundamental unit of Pandas is Series and Dataframe

Series:

A one-dimensional labeled array capable of holding any data type (integer, string, float, etc.).

Similar to a column in a spreadsheet or database table.

DataFrame:

A two-dimensional labeled data structure with columns of potentially different types.

Think of it as a spreadsheet or SQL table.

In [None]:
# Series
import pandas as pd
data = [10, 20, 30, 40]
series = pd.Series(data, name='Numbers')
print("Series")
print(series)
print("********")

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Score': [85, 90, 78]}
df = pd.DataFrame(data)
print(df)

First, let's create some sample data to work with. We'll generate sales data
that includes dates, products, regions, and various metrics.

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Generate sample sales data
np.random.seed(42)
n_rows = 300

# Sales Data
dates = pd.date_range('2023-01-01', periods=n_rows)
products = ['Laptop', 'Phone', 'Tablet', 'Watch']
regions = ['North', 'South', 'East', 'West']

sales_data = {
    'Date': dates,
    'Product': np.random.choice(products, n_rows),
    'Region': np.random.choice(regions, n_rows),
    'Units': np.random.randint(1, 50, n_rows),
    'Price': np.random.uniform(200, 2000, n_rows).round(2),
    'Customer_Rating': np.random.uniform(3, 5, n_rows).round(1)
}


In [None]:
df_sales = pd.DataFrame(sales_data)
#df_sales.head(5)
# create a new columnn (series) as function of other values
df_sales['Total_Sales'] = df_sales['Units'] * df_sales['Price']
df_sales.head(5)

In [18]:
# Generate customer data
customer_data = {
    'Region': regions,
    'Regional_Manager': ['John Smith', 'Emma Davis', 'Michael Chen', 'Sarah Wilson'],
    'Target_Revenue': np.random.uniform(100000, 500000, 4).round(2)
}
df_customers = pd.DataFrame(customer_data)

One of the most useful functions of pandas is ease of loading data from various sources - csv,xlsx and sql as well. You can checkout all the various ways pandas can read the files. Also Pandas can read files in sizes exceeding several GB (Excel cops out in a few hundred MBs). Similarly you can write back the data to csv easily

In [20]:
df_sales.to_csv('sales_data.csv',index=False)
df_customers.to_csv('customter_data.csv',index=False)

### Basic Data Exploration

In [None]:
print("\n=== Basic Data Exploration ===")
print("\nFirst few rows:")
print(df_sales.head())
print("\nDataset info:")
print(df_sales.info())
print("\nSummary statistics:")
print(df_sales.describe())

In [None]:
# 2. Data Filtering
print("\n=== Data Filtering ===")
high_value_sales = df_sales[df_sales['Total_Sales'] > df_sales['Total_Sales'].mean()]
print("\nHigh value sales (above mean):")
print(high_value_sales.head())

# Filter multiple conditions
laptop_north = df_sales[(df_sales['Product'] == 'Laptop') & (df_sales['Region'] == 'North')]
print("\nLaptop sales in North region:")
print(laptop_north.head())

### Data Aggregation
One of the principal requirements in Pandas is to group by some parameter. E.g. you want to summarise sales by region, then you will group by the region. It is very similar to GROUPBY in sql for those who have a backgraound in DBMS

In [28]:
region_sales = df_sales.groupby('Region')['Total_Sales'].agg(['min','max','sum'])
print("\nSales by region:")
print(region_sales)



Sales by region:
            min       max         sum
Region                               
East     839.65  82615.72  2080746.82
North    638.44  75718.40  1838729.69
South   2714.01  83102.88  2043672.52
West     736.33  84279.51  2687353.85


Pandas dataframes and series are actually built on numpy hence they are so fast. Below we will do some operations which combine dataframes and numpy arrays

In [None]:
# Convert dataframe to numpy array
sales_array = region_sales['min','max','sum'].to_numpy()

# Calculate the mean across columns
print("Mean across columns")
np.mean(sales_array,axis=1)

# Find the max sum across all regions
print("Max sum across regions")
print(np.max(sales_array[:,2]))

# Normalized sum brings all sum values between 0 and 1
min_sum = region_sales['sum'].min()
max_sum = region_sales['sum'].max()
region_sales['normalized_sum'] = (region_sales['sum'] - min_sum)/(max_sum - min_sum)


In [None]:
product_performance = df_sales.groupby('Product').agg({
    'Units': 'sum',
    'Total_Sales': 'sum',
    'Customer_Rating': 'mean'
}).round(2)
print("\nProduct performance:")
print(product_performance)

In [None]:
# 4. Time Series Analysis
print("\n=== Time Series Analysis ===")
monthly_sales = df_sales.set_index('Date').resample('M')['Total_Sales'].sum()
print("\nMonthly sales:")
print(monthly_sales)


A very useful feature of Pandas is 'merge' which is similar to 'JOIN' in SQL. Multiple files with a single common attribute can be merged in this way

In [None]:
print("\n=== Data Merging ===")
merged_df = df_sales.merge(df_customers, on='Region', how='left')
print("\nMerged data:")
print(merged_df.head())

### Methods and Lambda methods. 
I had held off introducing methods or functions till now since this is typically the cornerstone of any language. For data analysis, methods and anonymous methods (or lambda as we call them) become equally important

In [35]:
df_sales.columns

Index(['Date', 'Product', 'Region', 'Units', 'Price', 'Customer_Rating',
       'Total_Sales'],
      dtype='object')

In [49]:
# Methods is defined with 'def' keyword

def categorize_sales(sales:int) -> str: # Typically take an argument, do transformation and return an output
    if sales>25000:
        return 'High'
    elif sales >=15000 and sales <25000:
        return 'Med'
    else:
        return 'Low'
    

df_sales['sales_category'] = df_sales['Total_Sales'].apply(categorize_sales)
df_sales.head(5)

Unnamed: 0,Date,Product,Region,Units,Price,Customer_Rating,Total_Sales,sales_category
0,2023-01-01,Tablet,North,45,940.17,3.7,42307.65,High
1,2023-01-02,Watch,North,32,1285.01,3.1,41120.32,High
2,2023-01-03,Laptop,East,30,687.72,3.2,20631.6,Med
3,2023-01-04,Tablet,South,47,439.73,3.2,20667.31,Med
4,2023-01-05,Tablet,South,35,337.16,4.7,11800.6,Low


In [None]:
# Another important notation is the lambda notation where you write functions inline to transform the data
# x is the input parameter or argument

df_sales.drop(columns='sales_category')

df_sales['sales_category'] = df_sales['Total_Sales'].apply(
    lambda x: 'High' if x>25000 else 'Med' if x >=15000 else 'Low')
df_sales.head(5)

In [57]:
print(df_sales.columns)

Index(['Date', 'Product', 'Region', 'Units', 'Price', 'Customer_Rating',
       'Total_Sales', 'sales_category'],
      dtype='object')


### Combining some concepts - Groupby, Filter, lambda etc

In [None]:
# Among high sales_category entries, which region is giving the maximum sales

# first filter the dataframe
df_sales = df_sales[df_sales['sales_category']=='High']
# then apply groupby and sum
max_sum_per_region = df_sales.groupby('Region')['Total_Sales'].sum()
print(max_sum_per_region) # Groupby will transform the dataframe into a series with the index as region (since we grouped by region)
max_sum_region = max_sum_per_region.idxmax()
print(max_sum_region)

In [None]:
# Calculate average price per product
avg_price_per_product = df_sales.groupby('Product')['Price'].mean()
print(avg_price_per_product) # This gives a series where Product is the index and mean price is the value


In [63]:
# Generate profit margin using Lambda
# Add a cost column
df_sales['Cost'] = df_sales['Price']*np.random.uniform(0.6,0.8,size=len(df_sales))
# Add Profit margin column using lambda
df_sales['Profit_Margin'] = df_sales.apply(lambda row: ((row['Price']-row['Cost'])/row['Price'])*100,axis=1)

So finally we have covered essential Python concepts starting from basic data types and operations, progressing through advanced features like list comprehension, file handling, and directory management. The NumPy section demonstrated its power for numerical computations. Although you won't use numpy as much as Pandas  in data science. For Pandas we have laid the groundwork for data manipulation and analysis, emphasizing its fundamental structures (Series and DataFrame) and difference functions which can be applied to the tabular data