In [1]:
!pip install --upgrade pandas
import pandas as pd
from google.colab import files
uploaded = files.upload()

### 1. Reading Data from CSV with Specific Features
When reading CSV files, we often need to handle special cases:
- Different delimiters (here we use ';')
- Potential encoding issues
- Handling of missing values
- Parsing dates correctly

In [13]:
df = pd.read_csv('session_03_data_practice.csv', delimiter=';', parse_dates=['Creation Date', 'Date when pay'])

**Note: The parse_dates parameter attempts to convert specified columns to datetime objects**  
This is especially useful for time series analysis

### 2. Initial Data Exploration
Understanding your data is the first critical step in any analysis

**View first 5 rows to get a quick look at the data structure**

In [15]:
df.head()

Unnamed: 0,Number,Creation Date,Date when pay,Title,Status,Money amount,City,Payment System
0,1062823,01.12.2024 10:50,01.12.2024 10:52,AI Engineering,Completed,29597.5,Bucha,Банківський переказ
1,1062855,01.12.2024 20:53,01.12.2024 21:27,AI Engineering,Completed,17450.3,Kharkiv,Google Pay
2,1062856,01.12.2024 21:43,,Python. Data analysis,Canceled,0.0,Vinnytsia,
3,1062880,03.12.2024 0:18,,Frontend Development,Canceled,0.0,Pervomaisk,
4,1062899,03.12.2024 21:43,,AI Engineering,Canceled,0.0,Smila,


**View last 5 rows to check if data is consistent throughout**

In [16]:
df.tail()

Unnamed: 0,Number,Creation Date,Date when pay,Title,Status,Money amount,City,Payment System
287,1064720,30.12.2024 9:42,30.12.2024 12:49,Java Developer,Completed,2935.44,Irpin,Apple Pay
288,1064724,30.12.2024 11:32,,Frontend Development,Canceled,0.0,rivne,
289,1064775,31.12.2024 2:17,31.12.2024 2:22,Frontend Development,Completed,7423.92,Kherson,Apple Pay
290,1064793,31.12.2024 16:40,01.01.2020 14:29,Java Developer,Completed,2935.44,Irpin,ПриватБанк
291,1064796,31.12.2024 17:29,31.12.2024 17:32,Python. Web-developer,Completed,9898.56,Sumy,Apple Pay


**Get basic information about the DataFrame:**
- Number of non-null entries per column
- Data types of each column
- Memory usage

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292 entries, 0 to 291
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Number          292 non-null    int64  
 1   Creation Date   292 non-null    object 
 2   Date when pay   180 non-null    object 
 3   Title           292 non-null    object 
 4   Status          291 non-null    object 
 5   Money amount    292 non-null    float64
 6   City            266 non-null    object 
 7   Payment System  180 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 18.4+ KB


**Generate descriptive statistics for numeric columns:**
- Count, mean, std, min, quartiles, max

In [18]:
df.describe()

Unnamed: 0,Number,Money amount
count,292.0,292.0
mean,1063745.0,3397.615034
std,443.8688,5771.572829
min,1062823.0,0.0
25%,1063608.0,0.0
50%,1063698.0,2935.44
75%,1063807.0,2935.44
max,1064796.0,42750.0


**Get DataFrame dimensions (rows, columns)**

In [19]:
df.shape

(292, 8)

**View column names (important for referencing columns correctly)**

In [20]:
df.columns

Index(['Number ', 'Creation Date', 'Date when pay', 'Title', 'Status  ',
       'Money amount  ', 'City', 'Payment System'],
      dtype='object')

In [21]:
df.columns.tolist()

['Number ',
 'Creation Date',
 'Date when pay',
 'Title',
 'Status  ',
 'Money amount  ',
 'City',
 'Payment System']

**Initial observations we might make:**
- Mixed data types (numeric, text, dates)
- Some columns have missing values (like 'Date when pay')
- Potential data quality issues (spaces in city names, inconsistent capitalization)
- Numeric columns like 'Money amount' have wide ranges

### 3. Renaming Columns
**Column renaming is important for:**
- Consistency (standard naming conventions)
- Readability (clear, descriptive names)
- Ease of use (avoid spaces/special characters in names)

In [24]:
df = df.rename(columns={
    'Number ': 'transaction_id',
    'Creation Date': 'creation_date',
    'Date when pay': 'payment_date',
    'Title': 'course_title',
    'Status  ': 'status',
    'Money amount  ': 'amount',
    'City': 'city',
    'Payment System': 'payment_method'
})

**Verify the changes**

In [25]:
df.columns.tolist()

['transaction_id',
 'creation_date',
 'payment_date',
 'course_title',
 'status',
 'amount',
 'city',
 'payment_method']

### 4. Working with Columns and Rows

**Selecting Columns**
- Single column (returns Series)

In [27]:
amounts = df['amount']

In [28]:
print(amounts)

0      29597.50
1      17450.30
2          0.00
3          0.00
4          0.00
         ...   
287     2935.44
288        0.00
289     7423.92
290     2935.44
291     9898.56
Name: amount, Length: 292, dtype: float64


In [30]:
print(df.amount)

0      29597.50
1      17450.30
2          0.00
3          0.00
4          0.00
         ...   
287     2935.44
288        0.00
289     7423.92
290     2935.44
291     9898.56
Name: amount, Length: 292, dtype: float64


**Multiple columns (returns DataFrame)**

In [31]:
financial_data = df[['amount', 'payment_method']]

In [32]:
print(financial_data)

       amount       payment_method
0    29597.50  Банківський переказ
1    17450.30           Google Pay
2        0.00                  NaN
3        0.00                  NaN
4        0.00                  NaN
..        ...                  ...
287   2935.44            Apple Pay
288      0.00                  NaN
289   7423.92            Apple Pay
290   2935.44           ПриватБанк
291   9898.56            Apple Pay

[292 rows x 2 columns]


**Selecting Rows**
- By index

In [33]:
df.head(10)

Unnamed: 0,transaction_id,creation_date,payment_date,course_title,status,amount,city,payment_method
0,1062823,01.12.2024 10:50,01.12.2024 10:52,AI Engineering,Completed,29597.5,Bucha,Банківський переказ
1,1062855,01.12.2024 20:53,01.12.2024 21:27,AI Engineering,Completed,17450.3,Kharkiv,Google Pay
2,1062856,01.12.2024 21:43,,Python. Data analysis,Canceled,0.0,Vinnytsia,
3,1062880,03.12.2024 0:18,,Frontend Development,Canceled,0.0,Pervomaisk,
4,1062899,03.12.2024 21:43,,AI Engineering,Canceled,0.0,Smila,
5,1062900,03.12.2024 21:49,,Frontend Development,Canceled,0.0,TERNOPIL,
6,1062911,04.12.2024 13:12,,Frontend Development,Canceled,0.0,Vinnytsia,
7,1062914,04.12.2024 14:57,,Frontend Development,Canceled,0.0,Vinnytsia,
8,1062915,04.12.2024 16:08,,AI Engineering,Canceled,0.0,Kharkiv,
9,1062925,04.12.2024 21:51,,AI Engineering,Canceled,0.0,Kovel,


- By position (iloc)

In [34]:
df.iloc[5:11]

Unnamed: 0,transaction_id,creation_date,payment_date,course_title,status,amount,city,payment_method
5,1062900,03.12.2024 21:49,,Frontend Development,Canceled,0.0,TERNOPIL,
6,1062911,04.12.2024 13:12,,Frontend Development,Canceled,0.0,Vinnytsia,
7,1062914,04.12.2024 14:57,,Frontend Development,Canceled,0.0,Vinnytsia,
8,1062915,04.12.2024 16:08,,AI Engineering,Canceled,0.0,Kharkiv,
9,1062925,04.12.2024 21:51,,AI Engineering,Canceled,0.0,Kovel,
10,1062927,04.12.2024 22:02,,Python. Data analysis,Canceled,0.0,,


**Filtering Data**
- Completed transactions only

In [39]:
completed = df[df['status'] == 'Completed']
print(completed)

     transaction_id     creation_date      payment_date  \
0           1062823  01.12.2024 10:50  01.12.2024 10:52   
1           1062855  01.12.2024 20:53  01.12.2024 21:27   
12          1062938  05.12.2024 12:07  22.12.2024 12:29   
13          1062940  05.12.2024 15:35  05.12.2024 15:40   
14          1062947  05.12.2024 21:39  07.12.2024 13:35   
..              ...               ...               ...   
286         1064712  29.12.2024 22:58  30.12.2024 10:56   
287         1064720   30.12.2024 9:42  30.12.2024 12:49   
289         1064775   31.12.2024 2:17   31.12.2024 2:22   
290         1064793  31.12.2024 16:40  01.01.2020 14:29   
291         1064796  31.12.2024 17:29  31.12.2024 17:32   

               course_title     status    amount     city       payment_method  
0            AI Engineering  Completed  29597.50    Bucha  Банківський переказ  
1            AI Engineering  Completed  17450.30  Kharkiv           Google Pay  
12     Frontend Development  Completed   8910.00

- High-value transactions (> 20,000)

In [40]:
high_value = df[df['amount'] > 20000]
print(high_value)

     transaction_id     creation_date      payment_date  \
0           1062823  01.12.2024 10:50  01.12.2024 10:52   
16          1062949  05.12.2024 21:57   06.12.2024 9:15   
21          1063014  08.12.2024 21:58  08.12.2024 23:51   
37          1063251  14.12.2024 14:34  18.12.2024 13:11   
46          1063333  16.12.2024 11:11  17.12.2024 13:42   
50          1063391   17.12.2024 9:59  17.12.2024 10:01   
234         1064053   23.12.2024 0:38   23.12.2024 0:42   
270         1064553  27.12.2024 18:06  27.12.2024 18:09   

              course_title     status   amount          city  \
0           AI Engineering  Completed  29597.5         Bucha   
16          AI Engineering  Completed  29597.5         Irpin   
21          AI Engineering  Completed  29597.5      Cherkasy   
37          AI Engineering  Completed  42750.0       Kharkiv   
46   Python. Data analysis  Completed  29695.7          Lviv   
50   Python. Data analysis  Completed  29695.7       Kherson   
234  Python. Data an

- Multiple conditions (use & for AND, | for OR)

In [41]:
completed_high_value = df[(df['status'] == 'Completed') & (df['amount'] > 20000)]
print(completed_high_value)

     transaction_id     creation_date      payment_date  \
0           1062823  01.12.2024 10:50  01.12.2024 10:52   
16          1062949  05.12.2024 21:57   06.12.2024 9:15   
21          1063014  08.12.2024 21:58  08.12.2024 23:51   
37          1063251  14.12.2024 14:34  18.12.2024 13:11   
46          1063333  16.12.2024 11:11  17.12.2024 13:42   
50          1063391   17.12.2024 9:59  17.12.2024 10:01   
234         1064053   23.12.2024 0:38   23.12.2024 0:42   
270         1064553  27.12.2024 18:06  27.12.2024 18:09   

              course_title     status   amount          city  \
0           AI Engineering  Completed  29597.5         Bucha   
16          AI Engineering  Completed  29597.5         Irpin   
21          AI Engineering  Completed  29597.5      Cherkasy   
37          AI Engineering  Completed  42750.0       Kharkiv   
46   Python. Data analysis  Completed  29695.7          Lviv   
50   Python. Data analysis  Completed  29695.7       Kherson   
234  Python. Data an

- Filter by string contains (case sensitive)

In [42]:
ai_courses = df[df['course_title'].str.contains('AI Engineering', na=False)]
print(ai_courses)

     transaction_id     creation_date      payment_date    course_title  \
0           1062823  01.12.2024 10:50  01.12.2024 10:52  AI Engineering   
1           1062855  01.12.2024 20:53  01.12.2024 21:27  AI Engineering   
4           1062899  03.12.2024 21:43               NaN  AI Engineering   
8           1062915  04.12.2024 16:08               NaN  AI Engineering   
9           1062925  04.12.2024 21:51               NaN  AI Engineering   
11          1062931  04.12.2024 23:11               NaN  AI Engineering   
15          1062948  05.12.2024 21:56               NaN  AI Engineering   
16          1062949  05.12.2024 21:57   06.12.2024 9:15  AI Engineering   
21          1063014  08.12.2024 21:58  08.12.2024 23:51  AI Engineering   
34          1063174  13.12.2024 13:30               NaN  AI Engineering   
35          1063186  13.12.2024 15:30               NaN  AI Engineering   
36          1063225   14.12.2024 0:26               NaN  AI Engineering   
37          1063251  14.1

### 5. Basic Data Operations

**Adding Columns**
- Calculate payment delay in days (for completed transactions)

In [50]:
# Convert string dates to datetime first
df['creation_date'] = pd.to_datetime(df['creation_date'], dayfirst=True)
df['payment_date'] = pd.to_datetime(df['payment_date'], dayfirst=True, errors='coerce')

# Now calculate payment delay (in days)
df['payment_delay'] = (df['payment_date'] - df['creation_date']).dt.days

df.head()

Unnamed: 0,transaction_id,creation_date,payment_date,course_title,status,amount,city,payment_method,payment_delay
0,1062823,2024-12-01 10:50:00,2024-12-01 10:52:00,AI Engineering,Completed,29597.5,Bucha,Банківський переказ,0.0
1,1062855,2024-12-01 20:53:00,2024-12-01 21:27:00,AI Engineering,Completed,17450.3,Kharkiv,Google Pay,0.0
2,1062856,2024-12-01 21:43:00,NaT,Python. Data analysis,Canceled,0.0,Vinnytsia,,
3,1062880,2024-12-03 00:18:00,NaT,Frontend Development,Canceled,0.0,Pervomaisk,,
4,1062899,2024-12-03 21:43:00,NaT,AI Engineering,Canceled,0.0,Smila,,


**Create a binary column indicating high-value transactions**

In [51]:
df['is_high_value'] = df['amount'] > 20000
df.head()

Unnamed: 0,transaction_id,creation_date,payment_date,course_title,status,amount,city,payment_method,payment_delay,is_high_value
0,1062823,2024-12-01 10:50:00,2024-12-01 10:52:00,AI Engineering,Completed,29597.5,Bucha,Банківський переказ,0.0,True
1,1062855,2024-12-01 20:53:00,2024-12-01 21:27:00,AI Engineering,Completed,17450.3,Kharkiv,Google Pay,0.0,False
2,1062856,2024-12-01 21:43:00,NaT,Python. Data analysis,Canceled,0.0,Vinnytsia,,,False
3,1062880,2024-12-03 00:18:00,NaT,Frontend Development,Canceled,0.0,Pervomaisk,,,False
4,1062899,2024-12-03 21:43:00,NaT,AI Engineering,Canceled,0.0,Smila,,,False


**Removing Columns**
- Drop the temporary column we created

In [52]:
df = df.drop(columns=['is_high_value'])
df.head()

Unnamed: 0,transaction_id,creation_date,payment_date,course_title,status,amount,city,payment_method,payment_delay
0,1062823,2024-12-01 10:50:00,2024-12-01 10:52:00,AI Engineering,Completed,29597.5,Bucha,Банківський переказ,0.0
1,1062855,2024-12-01 20:53:00,2024-12-01 21:27:00,AI Engineering,Completed,17450.3,Kharkiv,Google Pay,0.0
2,1062856,2024-12-01 21:43:00,NaT,Python. Data analysis,Canceled,0.0,Vinnytsia,,
3,1062880,2024-12-03 00:18:00,NaT,Frontend Development,Canceled,0.0,Pervomaisk,,
4,1062899,2024-12-03 21:43:00,NaT,AI Engineering,Canceled,0.0,Smila,,


**Sorting Data**
- Sort by amount (descending)

In [54]:
df_sorted = df.sort_values('amount', ascending=False)
df_sorted.head()

Unnamed: 0,transaction_id,creation_date,payment_date,course_title,status,amount,city,payment_method,payment_delay
37,1063251,2024-12-14 14:34:00,2024-12-18 13:11:00,AI Engineering,Completed,42750.0,Kharkiv,WayForPay,3.0
46,1063333,2024-12-16 11:11:00,2024-12-17 13:42:00,Python. Data analysis,Completed,29695.7,Lviv,Google Pay,1.0
270,1064553,2024-12-27 18:06:00,2024-12-27 18:09:00,Python. Data analysis,Completed,29695.7,zAPORIZHZHIA,Google Pay,0.0
234,1064053,2024-12-23 00:38:00,2024-12-23 00:42:00,Python. Data analysis,Completed,29695.7,ZAPORIZHZHIA,Монобанк,0.0
50,1063391,2024-12-17 09:59:00,2024-12-17 10:01:00,Python. Data analysis,Completed,29695.7,Kherson,LiqPay,0.0


- Multi-column sort

In [55]:
df_sorted_multi = df.sort_values(['status', 'amount'], ascending=[True, False])
df_sorted_multi.head()

Unnamed: 0,transaction_id,creation_date,payment_date,course_title,status,amount,city,payment_method,payment_delay
2,1062856,2024-12-01 21:43:00,NaT,Python. Data analysis,Canceled,0.0,Vinnytsia,,
3,1062880,2024-12-03 00:18:00,NaT,Frontend Development,Canceled,0.0,Pervomaisk,,
4,1062899,2024-12-03 21:43:00,NaT,AI Engineering,Canceled,0.0,Smila,,
5,1062900,2024-12-03 21:49:00,NaT,Frontend Development,Canceled,0.0,TERNOPIL,,
6,1062911,2024-12-04 13:12:00,NaT,Frontend Development,Canceled,0.0,Vinnytsia,,


### Additional Important Initial Analysis Steps
**Checking for Missing Values**

In [57]:
print("\nMissing values per column:")
print(df.isnull().sum())


Missing values per column:
transaction_id      0
creation_date       0
payment_date      112
course_title        0
status              1
amount              0
city               26
payment_method    112
payment_delay     112
dtype: int64


**Examining Unique Values**

In [58]:
print("\nUnique status values:", df['status'].unique())
print("Unique course titles:", df['course_title'].unique())


Unique status values: ['Completed' 'Canceled' 'Pending payment' nan 'Partially paid']
Unique course titles: ['AI Engineering' 'Python. Data analysis' 'Frontend Development'
 'Machine Learning' 'Python. Web-developer ' 'Java Developer']


**Data Quality Checks**
- Check for inconsistent city names (case sensitivity, whitespace)

In [59]:
print("\nSample city names:", df['city'].str.strip().unique()[:10])


Sample city names: ['Bucha' 'Kharkiv' 'Vinnytsia' 'Pervomaisk' 'Smila' 'TERNOPIL' 'Kovel' nan
 'Lviv' 'Sumy']


- Value Counts (useful for categorical data)

In [60]:
print("\nPayment method distribution:")
print(df['payment_method'].value_counts())


Payment method distribution:
payment_method
PayPal                 29
Монобанк               21
Apple Pay              21
Google Pay             20
Банківський переказ    19
WayForPay              19
LiqPay                 17
ПриватБанк             13
Готівка                11
Portmone               10
Name: count, dtype: int64


- Cross-tabulation (relationship between two categorical variables)

In [61]:
print("\nCourse vs Status:")
print(pd.crosstab(df['course_title'], df['status']))


Course vs Status:
status                  Canceled  Completed  Partially paid  Pending payment
course_title                                                                
AI Engineering                17          5               0                0
Frontend Development          21         31               0                0
Java Developer                55        125               0                1
Machine Learning               5          3               1                0
Python. Data analysis          5          7               0                0
Python. Web-developer          5          9               1                0


**Handling Text Data**
- Clean city names (remove extra spaces, standardize case)

In [62]:
df['city'] = df['city'].str.strip().str.title()

- Extract year/month from dates for temporal analysis

In [63]:
df['creation_year'] = df['creation_date'].dt.year
df['creation_month'] = df['creation_date'].dt.month

### Initial Observations:

1. **Data Structure Preview**:
   - The dataset contains transaction records with columns like `Number`, `Creation Date`, `Payment Date`, `Title`, `Status`, `Money Amount`, `City`, and `Payment System`.
   - Mixed data types are visible at a glance (dates, numbers, text).

2. **Missing Values**:
   - `Date when pay` is empty for canceled transactions (e.g., rows 3-7 in head).
   - `Money amount` shows `0` for canceled transactions.
   - Some `City` and `Payment System` fields are empty (e.g., row 11 in head).

3. **Status Patterns**:
   - `Completed` status correlates with:
     - Filled payment dates
     - Positive money amounts
     - Specified payment methods
   - `Canceled` status shows:
     - Empty payment dates
     - Zero amounts
     - Often missing payment methods

4. **Data Quality Indicators**:
   - Inconsistent city name formatting:
     - Mixed case (e.g., "TERNOPIL" vs "Vinnytsia")
     - Trailing spaces (e.g., " Lviv" vs "Lviv")
     - Combined city names ("OdesaDnipro")
   - Payment method names use different languages (Ukrainian and English)

5. **Temporal Patterns**:
   - Transactions span December 2024 (based on creation dates)
   - Payment delays vary (e.g., row 1 shows 2-minute delay, while row 14 shows 17-day delay)

6. **Business Context Clues**:
   - Course titles suggest an educational platform ("AI Engineering", "Frontend Development", etc.)
   - Payment amounts vary significantly (from 1.0 to 42750.0)
   - Multiple payment systems are supported (Google Pay, PayPal, bank transfers)

7. **Potential Anomalies**:
   - Very small payments (1.0) in rows 28-34 might represent test transactions
   - Some payment dates (e.g., "01.01.2020" in last row) appear inconsistent with creation dates

**Key Questions for Further Investigation**:
- Why do some completed transactions have payment dates before creation dates?
- Should city names be standardized (case, spacing)?
- Are the 1.0 payments valid or system artifacts?
- What explains the extreme payment amount range?

This initial inspection reveals both the dataset's structure and immediate data quality considerations that would need addressing before deeper analysis.