In [None]:
!pip install --upgrade pandas

In [2]:
import pandas as pd
from google.colab import files
uploaded = files.upload()

Saving session_03_data_practice.csv to session_03_data_practice.csv


### 1. Reading Data from CSV with Specific Features
When reading CSV files, we often need to handle special cases:
- Different delimiters (here we use ';')
- Potential encoding issues
- Handling of missing values
- Parsing dates correctly

In [7]:
df = pd.read_csv('session_03_data_practice.csv', delimiter=';', parse_dates=['Creation Date', 'Date when pay'])

**Note: The parse_dates parameter attempts to convert specified columns to datetime objects**  
This is especially useful for time series analysis

### 2. Initial Data Exploration
Understanding your data is the first critical step in any analysis

**View first 5 rows to get a quick look at the data structure**

In [8]:
df.head()

Unnamed: 0,Number,Creation Date,Date when pay,Title,Status,Money amount,City,Payment System
0,1062823,01.12.2024 10:50,01.12.2024 10:52,AI Engineering,Completed,29597.5,Bucha,Банківський переказ
1,1062855,01.12.2024 20:53,01.12.2024 21:27,AI Engineering,Completed,17450.3,Kharkiv,Google Pay
2,1062856,01.12.2024 21:43,,Python. Data analysis,Canceled,0.0,Vinnytsia,
3,1062880,03.12.2024 0:18,,Frontend Development,Canceled,0.0,Pervomaisk,
4,1062899,03.12.2024 21:43,,AI Engineering,Canceled,0.0,Smila,


**View last 5 rows to check if data is consistent throughout**

In [9]:
df.tail()

Unnamed: 0,Number,Creation Date,Date when pay,Title,Status,Money amount,City,Payment System
287,1064720,30.12.2024 9:42,30.12.2024 12:49,Java Developer,Completed,2935.44,Irpin,Apple Pay
288,1064724,30.12.2024 11:32,,Frontend Development,Canceled,0.0,rivne,
289,1064775,31.12.2024 2:17,31.12.2024 2:22,Frontend Development,Completed,7423.92,Kherson,Apple Pay
290,1064793,31.12.2024 16:40,01.01.2020 14:29,Java Developer,Completed,2935.44,Irpin,ПриватБанк
291,1064796,31.12.2024 17:29,31.12.2024 17:32,Python. Web-developer,Completed,9898.56,Sumy,Apple Pay


**Get basic information about the DataFrame:**
- Number of non-null entries per column
- Data types of each column
- Memory usage

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292 entries, 0 to 291
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Number          292 non-null    int64  
 1   Creation Date   292 non-null    object 
 2   Date when pay   180 non-null    object 
 3   Title           292 non-null    object 
 4   Status          291 non-null    object 
 5   Money amount    292 non-null    float64
 6   City            266 non-null    object 
 7   Payment System  180 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 18.4+ KB


**Generate descriptive statistics for numeric columns:**
- Count, mean, std, min, quartiles, max

In [11]:
df.describe()

Unnamed: 0,Number,Money amount
count,292.0,292.0
mean,1063745.0,3397.615034
std,443.8688,5771.572829
min,1062823.0,0.0
25%,1063608.0,0.0
50%,1063698.0,2935.44
75%,1063807.0,2935.44
max,1064796.0,42750.0


**Get DataFrame dimensions (rows, columns)**

In [12]:
df.shape

(292, 8)

**View column names (important for referencing columns correctly)**

In [13]:
df.columns

Index(['Number ', 'Creation Date', 'Date when pay', 'Title', 'Status  ',
       'Money amount  ', 'City', 'Payment System'],
      dtype='object')

In [14]:
df.columns.tolist()

['Number ',
 'Creation Date',
 'Date when pay',
 'Title',
 'Status  ',
 'Money amount  ',
 'City',
 'Payment System']

**Initial observations we might make:**
- Mixed data types (numeric, text, dates)
- Some columns have missing values (like 'Date when pay')
- Potential data quality issues (spaces in city names, inconsistent capitalization)
- Numeric columns like 'Money amount' have wide ranges

### 3. Renaming Columns
**Column renaming is important for:**
- Consistency (standard naming conventions)
- Readability (clear, descriptive names)
- Ease of use (avoid spaces/special characters in names)

In [20]:
df = df.rename(columns={
    'Number ': 'transaction_id',
    'Creation Date': 'creation_date',
    'Date when pay': 'payment_date',
    'Title': 'course_title',
    'Status  ': 'status',
    'Money amount  ': 'amount',
    'City': 'city',
    'Payment System': 'payment_method'
})

**Verify the changes**

In [21]:
df.columns.to_list()

['transaction_id',
 'creation_date',
 'payment_date',
 'course_title',
 'status',
 'amount',
 'city',
 'payment_method']

In [22]:
df.payment_date

Unnamed: 0,payment_date
0,01.12.2024 10:52
1,01.12.2024 21:27
2,
3,
4,
...,...
287,30.12.2024 12:49
288,
289,31.12.2024 2:22
290,01.01.2020 14:29


### 4. Working with Columns and Rows

**Selecting Columns**
- Single column (returns Series)

In [23]:
df.course_title

Unnamed: 0,course_title
0,AI Engineering
1,AI Engineering
2,Python. Data analysis
3,Frontend Development
4,AI Engineering
...,...
287,Java Developer
288,Frontend Development
289,Frontend Development
290,Java Developer


In [24]:
df['course_title']

Unnamed: 0,course_title
0,AI Engineering
1,AI Engineering
2,Python. Data analysis
3,Frontend Development
4,AI Engineering
...,...
287,Java Developer
288,Frontend Development
289,Frontend Development
290,Java Developer


**Multiple columns (returns DataFrame)**

In [25]:
new_df = df[['course_title', 'status']]

In [26]:
new_df

Unnamed: 0,course_title,status
0,AI Engineering,Completed
1,AI Engineering,Completed
2,Python. Data analysis,Canceled
3,Frontend Development,Canceled
4,AI Engineering,Canceled
...,...,...
287,Java Developer,Completed
288,Frontend Development,Canceled
289,Frontend Development,Completed
290,Java Developer,Completed


**Selecting Rows**
- By index

In [28]:
new_df.iloc[10]

Unnamed: 0,10
course_title,Python. Data analysis
status,Canceled


- By position (iloc)

In [31]:
new_df.iloc[20:10:-2]

Unnamed: 0,course_title,status
20,Frontend Development,Canceled
18,Frontend Development,Canceled
16,AI Engineering,Completed
14,Frontend Development,Completed
12,Frontend Development,Completed


**Filtering Data**
- Completed transactions only

In [32]:
new_df = df[df['status'] == 'Completed']

In [34]:
new_df.describe()

Unnamed: 0,transaction_id,amount
count,180.0,180.0
mean,1063750.0,5511.686611
std,424.7928,6514.332458
min,1062823.0,1.0
25%,1063610.0,2935.44
50%,1063684.0,2935.44
75%,1063794.0,4230.5775
max,1064796.0,42750.0


- High-value transactions (> 20,000)

In [35]:
new_df = df[df['amount'] > 20000]

In [36]:
new_df.describe()

Unnamed: 0,transaction_id,amount
count,8.0,8.0
mean,1063421.0,31290.6625
std,593.3475,4630.526585
min,1062823.0,29597.5
25%,1062998.0,29597.5
50%,1063292.0,29695.7
75%,1063556.0,29695.7
max,1064553.0,42750.0


- Multiple conditions (use & for AND, | for OR)

In [37]:
new_df = df[(df['status'] == 'Completed') & (df['amount'] > 20000)]

In [38]:
new_df.describe()

Unnamed: 0,transaction_id,amount
count,8.0,8.0
mean,1063421.0,31290.6625
std,593.3475,4630.526585
min,1062823.0,29597.5
25%,1062998.0,29597.5
50%,1063292.0,29695.7
75%,1063556.0,29695.7
max,1064553.0,42750.0


- Filter by string contains (case sensitive)

In [39]:
python_courses = df[df['course_title'].str.contains('Python')]

In [40]:
python_courses

Unnamed: 0,transaction_id,creation_date,payment_date,course_title,status,amount,city,payment_method
2,1062856,01.12.2024 21:43,,Python. Data analysis,Canceled,0.0,Vinnytsia,
10,1062927,04.12.2024 22:02,,Python. Data analysis,Canceled,0.0,,
33,1063109,11.12.2024 23:52,12.12.2024 0:02,Python. Web-developer,Completed,9898.56,Kostiantynivka,WayForPay
40,1063283,14.12.2024 21:55,14.12.2024 23:57,Python. Data analysis,Completed,19698.9,zAPORIZHZHIA,Монобанк
44,1063310,15.12.2024 21:44,,Python. Data analysis,Canceled,0.0,,
46,1063333,16.12.2024 11:11,17.12.2024 13:42,Python. Data analysis,Completed,29695.7,Lviv,Google Pay
50,1063391,17.12.2024 9:59,17.12.2024 10:01,Python. Data analysis,Completed,29695.7,Kherson,LiqPay
213,1063797,19.12.2024 22:25,20.12.2024 12:54,Python. Web-developer,Completed,9898.56,Ivano-Frankivsk,PayPal
224,1063888,21.12.2024 10:44,,Python. Web-developer,Canceled,0.0,oDesa,
233,1064044,22.12.2024 21:45,,Python. Data analysis,Canceled,0.0,Kyiv,


In [41]:
python_courses.shape

(27, 8)

In [42]:
python_courses.describe()

Unnamed: 0,transaction_id,amount
count,27.0,27.0
mean,1064067.0,9231.391111
std,597.9821,10467.163664
min,1062856.0,0.0
25%,1063594.0,0.0
50%,1064351.0,9595.0
75%,1064512.0,9898.56
max,1064796.0,29695.7


### 5. Basic Data Operations

**Adding Columns**
- Calculate payment delay in days (for completed transactions)

In [44]:
df['creation_date'] = pd.to_datetime(df['creation_date'], dayfirst=True)
df['payment_date'] = pd.to_datetime(df['payment_date'], dayfirst=True, errors='coerce')

# Now calculate payment delay (in days)
df['payment_delay'] = (df['payment_date'] - df['creation_date']).dt.days

payed_courses = df[df['status'] == 'Completed']
payed_courses.head()

Unnamed: 0,transaction_id,creation_date,payment_date,course_title,status,amount,city,payment_method,payment_delay
0,1062823,2024-12-01 10:50:00,2024-12-01 10:52:00,AI Engineering,Completed,29597.5,Bucha,Банківський переказ,0.0
1,1062855,2024-12-01 20:53:00,2024-12-01 21:27:00,AI Engineering,Completed,17450.3,Kharkiv,Google Pay,0.0
12,1062938,2024-12-05 12:07:00,2024-12-22 12:29:00,Frontend Development,Completed,8910.0,Lviv,PayPal,17.0
13,1062940,2024-12-05 15:35:00,2024-12-05 15:40:00,Machine Learning,Completed,9394.56,,LiqPay,0.0
14,1062947,2024-12-05 21:39:00,2024-12-07 13:35:00,Frontend Development,Completed,5044.05,Sumy,LiqPay,1.0


**Create a binary column indicating high-value transactions**

In [45]:
df['is_high_value'] = df['amount'] > 20000
df.head()

Unnamed: 0,transaction_id,creation_date,payment_date,course_title,status,amount,city,payment_method,payment_delay,is_high_value
0,1062823,2024-12-01 10:50:00,2024-12-01 10:52:00,AI Engineering,Completed,29597.5,Bucha,Банківський переказ,0.0,True
1,1062855,2024-12-01 20:53:00,2024-12-01 21:27:00,AI Engineering,Completed,17450.3,Kharkiv,Google Pay,0.0,False
2,1062856,2024-12-01 21:43:00,NaT,Python. Data analysis,Canceled,0.0,Vinnytsia,,,False
3,1062880,2024-12-03 00:18:00,NaT,Frontend Development,Canceled,0.0,Pervomaisk,,,False
4,1062899,2024-12-03 21:43:00,NaT,AI Engineering,Canceled,0.0,Smila,,,False


**Removing Columns**
- Drop the temporary column we created

In [46]:
df = df.drop('payment_delay', axis=1)

df.head()

Unnamed: 0,transaction_id,creation_date,payment_date,course_title,status,amount,city,payment_method,is_high_value
0,1062823,2024-12-01 10:50:00,2024-12-01 10:52:00,AI Engineering,Completed,29597.5,Bucha,Банківський переказ,True
1,1062855,2024-12-01 20:53:00,2024-12-01 21:27:00,AI Engineering,Completed,17450.3,Kharkiv,Google Pay,False
2,1062856,2024-12-01 21:43:00,NaT,Python. Data analysis,Canceled,0.0,Vinnytsia,,False
3,1062880,2024-12-03 00:18:00,NaT,Frontend Development,Canceled,0.0,Pervomaisk,,False
4,1062899,2024-12-03 21:43:00,NaT,AI Engineering,Canceled,0.0,Smila,,False


**Sorting Data**
- Sort by amount (descending)

In [47]:
df_sorted = df.sort_values('amount', ascending=False)
df_sorted.head()

Unnamed: 0,transaction_id,creation_date,payment_date,course_title,status,amount,city,payment_method,is_high_value
37,1063251,2024-12-14 14:34:00,2024-12-18 13:11:00,AI Engineering,Completed,42750.0,Kharkiv,WayForPay,True
46,1063333,2024-12-16 11:11:00,2024-12-17 13:42:00,Python. Data analysis,Completed,29695.7,Lviv,Google Pay,True
270,1064553,2024-12-27 18:06:00,2024-12-27 18:09:00,Python. Data analysis,Completed,29695.7,zAPORIZHZHIA,Google Pay,True
234,1064053,2024-12-23 00:38:00,2024-12-23 00:42:00,Python. Data analysis,Completed,29695.7,ZAPORIZHZHIA,Монобанк,True
50,1063391,2024-12-17 09:59:00,2024-12-17 10:01:00,Python. Data analysis,Completed,29695.7,Kherson,LiqPay,True


- Multi-column sort

In [48]:
df_sorted_multi = df.sort_values(['status', 'amount'], ascending=[True, False])
df_sorted_multi.head()

Unnamed: 0,transaction_id,creation_date,payment_date,course_title,status,amount,city,payment_method,is_high_value
2,1062856,2024-12-01 21:43:00,NaT,Python. Data analysis,Canceled,0.0,Vinnytsia,,False
3,1062880,2024-12-03 00:18:00,NaT,Frontend Development,Canceled,0.0,Pervomaisk,,False
4,1062899,2024-12-03 21:43:00,NaT,AI Engineering,Canceled,0.0,Smila,,False
5,1062900,2024-12-03 21:49:00,NaT,Frontend Development,Canceled,0.0,TERNOPIL,,False
6,1062911,2024-12-04 13:12:00,NaT,Frontend Development,Canceled,0.0,Vinnytsia,,False


### Additional Important Initial Analysis Steps
**Checking for Missing Values**

**Examining Unique Values**

**Data Quality Checks**
- Check for inconsistent city names (case sensitivity, whitespace)

- Value Counts (useful for categorical data)

### Initial Observations:

1. **Data Structure Preview**:
   - The dataset contains transaction records with columns like `Number`, `Creation Date`, `Payment Date`, `Title`, `Status`, `Money Amount`, `City`, and `Payment System`.
   - Mixed data types are visible at a glance (dates, numbers, text).

2. **Missing Values**:
   - `Date when pay` is empty for canceled transactions (e.g., rows 3-7 in head).
   - `Money amount` shows `0` for canceled transactions.
   - Some `City` and `Payment System` fields are empty (e.g., row 11 in head).

3. **Status Patterns**:
   - `Completed` status correlates with:
     - Filled payment dates
     - Positive money amounts
     - Specified payment methods
   - `Canceled` status shows:
     - Empty payment dates
     - Zero amounts
     - Often missing payment methods

4. **Data Quality Indicators**:
   - Inconsistent city name formatting:
     - Mixed case (e.g., "TERNOPIL" vs "Vinnytsia")
     - Trailing spaces (e.g., " Lviv" vs "Lviv")
     - Combined city names ("OdesaDnipro")
   - Payment method names use different languages (Ukrainian and English)

5. **Temporal Patterns**:
   - Transactions span December 2024 (based on creation dates)
   - Payment delays vary (e.g., row 1 shows 2-minute delay, while row 14 shows 17-day delay)

6. **Business Context Clues**:
   - Course titles suggest an educational platform ("AI Engineering", "Frontend Development", etc.)
   - Payment amounts vary significantly (from 1.0 to 42750.0)
   - Multiple payment systems are supported (Google Pay, PayPal, bank transfers)

7. **Potential Anomalies**:
   - Very small payments (1.0) in rows 28-34 might represent test transactions
   - Some payment dates (e.g., "01.01.2020" in last row) appear inconsistent with creation dates

**Key Questions for Further Investigation**:
- Why do some completed transactions have payment dates before creation dates?
- Should city names be standardized (case, spacing)?
- Are the 1.0 payments valid or system artifacts?
- What explains the extreme payment amount range?

This initial inspection reveals both the dataset's structure and immediate data quality considerations that would need addressing before deeper analysis.