# 6.1 Working with Files & Automation

Automation allows data analysts to **process data efficiently**, reduce repetitive manual work, and focus more on **strategic analysis and interpretation** rather than routine tasks.

---

## Why Automation in Data Analysis?

- Saves time on repetitive tasks
- Reduces human errors
- Ensures consistency and reproducibility
- Enables scalable data processing (large files, multiple datasets)

---

## Key Concepts

### 1) Automation in Data Analysis
Automating tasks such as:
- Reading multiple files
- Cleaning datasets
- Generating reports
- Scheduling scripts

---

### 2) File Management
File management refers to handling files and directories programmatically.

#### Steps in Automation
1. Identify repetitive tasks  
2. Define the process clearly  
3. Choose the right tools & libraries  
4. Write, test, and optimize the code  

# 6.2 File Handling in Python

Python provides **built-in support** for file handling using the `open()` function.

A file is accessed through a **file handle (file object)**, which allows reading or writing data.

---

## Basic File Operations

1. Open a file  
2. Read or write data  
3. Close the file

## 6.2.1 Opening a File
file = open("data.txt", mode)

---

## Common File Modes

| Mode | Description |
|---|---|
| `r` | Read mode (default) |
| `w` | Write mode (creates file if it doesn't exist, overwrites if it does) |
| `x` | Exclusive creation (fails if file exists) |
| `a` | Append mode (adds data at the end of file) |
| `b` | Binary mode |
| `+` | Read and write mode |

In [6]:
file = open("data.txt")

In [7]:
file.close()

 #### Using with Statement (Best Practice)
 - The with statement automatically closes the file, even if an error occurs.
 - ✔ No need to call file.close()
 - ✔ Safer and cleaner approach

In [10]:
with open("data.txt", "r") as file:
    context = file.read
context

<function TextIOWrapper.read(size=-1, /)>

## 6.2.2 Reading Files

### Read Entire File

In [11]:
file = open("data.txt", "r")
content = file.read()
file.close()

### Read Line by Line

In [13]:
file = open("data.txt", "r")
for line in file:
    print(line)
file.close()

Module 6: Working with Files & Automation



Automation allows data analysts to process data efficiently, reduce repetitive manual work, and focus more on strategic analysis and interpretation rather than routine tasks.


 ### 6.2.3 Writing to a File: `w` mode will overwrite existing content.

In [18]:
file = open("data.txt", "w")
file.write("Automation allows data analysts to process data efficiently, reduce repetitive manual work, and focus more on strategic analysis and interpretation")
file.close()

 ### 6.2.4 Appending to a File

In [19]:
file = open("data.txt", "a")
file.write("\nrather than routine tasks")
file.close()

# 6.3 Working with OS Module
The `os` module in :contentReference[oaicite:0]{index=0} is used to **interact with the underlying Operating System**.  
It provides a **portable way** to perform file and directory management tasks, which are foundational in **data analysis pipelines**.

For Data Analysts, the OS module is especially useful for:
- Managing data storage
- Automating file handling
- Organizing datasets
- Building robust data pipelines

---

## Key Uses of OS Module in Data Analysis

1. Navigating directories  
2. Listing and finding files  
3. Managing files and folders  
4. Organizing datasets  
5. Checking file status  
6. Accessing file metadata

---

## Important OS Module Functions

### 1) Get Current Working Directory
 - Returns the path of the current working directory.
    - `import os`
    - `os.getcwd()`

 ### 2) Change Working Directory
 - Changes the current working directory to the given path.
 - `os.chdir("C:/Users/Data")`

### 3) List Files and Directories
 - Returns a list of all files and directories in the specified path.
 - `os.listdir()`          # Current directory
 - `os.listdir("Data")`    # Specific directory

### 4) Create a Directory
 - Creates a new directory.
 - `os.mkdir("New_Folder")`

### 5) Create Nested Directories
 - Creates parent and child directories recursively.
 - `os.makedirs("Parent/Child")`

### 6) Remove a File
 - Deletes a specified file.
 - `os.remove("file.txt")`

### 7) Remove an Empty Directory
 - Deletes an empty directory.
 - `os.rmdir("Old_Folder")`

### 8) Path Handling (`os.path`)
The `os.path` submodule is used to work with file paths safely across platforms.

### 9) Check if Path Exists
 - `os.path.exists("file.txt")`
   
### 10) Check if File or Directory
 - `os.path.isfile("file.txt")`
 - `os.path.isdir("Folder")`

### 11) Join Paths (Platform Independent)
 - Avoids OS-specific path issues (Windows vs Linux).
 - `os.path.join("folder", "file.txt")`
 - ✔ Recommended over string concatenation
 - ✔ Ensures portability

---

## File & Directory Operations Summary

| Task | Function |
|---|---|
| Current directory | `os.getcwd()` |
| Change directory | `os.chdir()` |
| List files | `os.listdir()` |
| Create directory | `os.mkdir()` |
| Create nested directories | `os.makedirs()` |
| Remove file | `os.remove()` |
| Remove directory | `os.rmdir()` |
| Check existence | `os.path.exists()` |
| Join paths | `os.path.join()` |

---

### Why OS Module is Important for Data Analysts

- **Automates dataset organization**
- Handles large volumes of files
- Enables scalable ETL pipelines
- Prevents manual errors
- Works across operating systems

# 6.4 Automating Tasks: Writing Reusable Scripts
 - Automating tasks in Data Analytics involves using scripts to **streamline repetitive processes** such as data cleaning, transformation, analysis, and reporting.  
 - Automation saves time, reduces human error, and ensures **reproducible results**.
 - Python & SQL are the most commonly used language for automation in data analytics due to its simplicity and rich ecosystem of libraries.

---

## Benefits of Automation

1. Time Saving  
2. Error Reduction  
3. Scalability  
4. Reproducibility  
5. More Focus on Insights, Not Manual Work  

---

## Principles of Writing Reusable Scripts

### 1) Modularity
Break scripts into **small, focused functions and modules**.
### 2) Parameterization
Avoid hard-coding values. Use parameters instead.
### 3) Error Handling
Handle unexpected errors using `try-except`.
### 4) Documentation
Use comments and docstrings for clarity.
### 5) Version Control
Use `Git` to track changes and collaborate efficiently.
### 6) Testing
Test functions with small inputs before scaling.

---

## Common Automation Tasks & Tools
### 1) Data Collection
 - APIs → requests
 - Web Scraping → BeautifulSoup, Selenium

### 2) Data Cleaning & Preparation
 - pandas
 - scikit-learn pipelines
 - SQL

### 3) Exploratory Data Analysis (EDA)
 - Automated summary statistics
 - Batch visualization
 - Outlier detection scripts

### 4) Model Training & Evaluation
 - Reusable training pipelines
 - Automated evaluation metrics

### 5) Reporting & Orchestration
 - Automated report generation
 - Scheduled pipelines (cron / task scheduler)

# 6.5 Automating Excel & CSV Cleaning
Automation can be used to repeatedly clean Excel or CSV files using Python libraries.

---

## Common Libraries
 - pandas
 - openpyxl

---

## Typical Steps
 - Read the data
    - `df = pd.read_csv("data.csv")`
    - `df = pd.read_excel("data.xlsx")`
 - Clean the data
    - `df = df.dropna()`
 - Save the cleaned data
    - `df.to_csv("cleaned_data.csv", index=False)`

# 6.6 Loop-Based Data Processing
Loops allow batch processing of multiple files automatically.

### 1) Processing Multiple Files

In [None]:
# import os

# for file in os.listdir("data"):
#     if file.endswith(".txt"):
#         with open(os.path.join("data", file)) as f:
#             print(f.read())


### 2) Processing Multiple CSV Files

In [20]:
# for file in os.listdir("data"):
#     if file.endswith(".csv"):
#         df = pd.read_csv(os.path.join("data", file))
#         print(df.shape)


### 3) Automated Calculations Example

In [21]:
# total = 0
# values = [10, 20, 30]

# for value in values:
#     total += value

# print("Total:", total)


### Automated Report Generation
  Loop through datasets
 - Calculate metrics
 - Export results to CSV / Excel / PDF

---

### Advantages of File Automation
 - Saves Time
 - Improves Accuracy
 - Scales Easily
 - Enables Scheduled Tasks
 - Reduces Manual Dependency