# Files

## Read a File

Read the content of the file `files/poem.txt` and print it.

In [2]:
with open('files/poem.txt', 'r') as file:
    content = file.read()
    print(content)

The road not taken
By Robert Frost

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;



## Write to a File

Write a note to the `files/note.txt` file that says, "Remember to buy milk and eggs."

In [6]:
with open('files/note.txt', 'w') as file:
    file.write("Remember to buy milk and eggs.")

## Append to a File

Append a new task to the previous `note.txt` file in a new line: "And Beer!"

In [7]:
with open('files/note.txt', 'a') as file:
    file.write("\nAnd Beer!")

In [8]:
!cat files/note.txt

Remember to buy milk and eggs.
And Beer!

## Counting Lines

Count the number of times "apple" appears in the `files/fruits.txt` file and print the count (it should be 5).

In [9]:
count = 0
with open('files/fruits.txt', 'r') as file:
    for line in file:
        if 'apple' in line.strip():
            count += 1
print(f"'apple' appears {count} times.")

'apple' appears 5 times.


## Copy

Copy the content from `files/source.txt` to the new file `files/destination.txt`.

In [10]:
with open('files/source.txt', 'r') as source:
    content = source.read()
    with open('files/destination.txt', 'w') as destination:
        destination.write(content)

## Working With CSV

Open the `files/grades.csv` file, calculate the average grade for each student, and print their names along with their average grades.

Solve this exercise in 3 different ways:
- Read the file in plain text (without using any CSV module).
- Use `csv` module.
- Use `pandas` module.

In [12]:
# Plain text

averages = {}

with open('files/grades.csv', 'r') as file:
    lines = file.readlines()[1:]  # Skip the header line
    
    for line in lines:
        parts = line.strip().split(',')
        name = parts[0]
        grades = [int(x) for x in parts[1:]]
        avg = sum(grades) / len(grades)
        averages[name] = avg

for name, avg in averages.items():
    print(f"{name}'s average grade is: {avg:.2f}")

Alice's average grade is: 87.67
Bob's average grade is: 85.33
Charlie's average grade is: 85.00
John's average grade is: 60.33
Amy's average grade is: 55.33
Hector's average grade is: 76.33
Mark's average grade is: 34.00


In [14]:
# csv
import csv

averages = {}

with open('files/grades.csv', 'r') as file:
    reader = csv.DictReader(file)
    
    for row in reader:
        name = row['Name']
        grades = [int(row['Math']), int(row['Science']), int(row['English'])]
        avg = sum(grades) / len(grades)
        averages[name] = avg

for name, avg in averages.items():
    print(f"{name}'s average grade is: {avg:.2f}")

Alice's average grade is: 87.67
Bob's average grade is: 85.33
Charlie's average grade is: 85.00
John's average grade is: 60.33
Amy's average grade is: 55.33
Hector's average grade is: 76.33
Mark's average grade is: 34.00


In [27]:
# pandas
import pandas as pd

df = pd.read_csv('files/grades.csv')

In [28]:
pd.concat([df['Name'], df.iloc[:, 1:].mean(axis=1).round(2)], axis=1, keys=['Name', 'Mean'])

Unnamed: 0,Name,Mean
0,Alice,87.67
1,Bob,85.33
2,Charlie,85.0
3,John,60.33
4,Amy,55.33
5,Hector,76.33
6,Mark,34.0


## Simple

Given a JSON file `files/simple.json`, add a new key-value pair "Language": "Python" and write the modified data back to the file.

In [30]:
import json

with open('files/simple.json', 'r') as f:
    data = json.load(f)

data['Language'] = 'Python'

with open('files/simple.json', 'w') as f:
    json.dump(data, f, indent=4)

In [31]:
!cat files/simple.json

{
    "Name": "John",
    "Age": 25,
    "City": "New York",
    "Language": "Python"
}

## Read and Display

Read data from the `files/info.yml` YAML file and display the content in the follwing order: `name`, `age`, and `hobbies`.

In [36]:
import yaml

with open('files/info.yml', 'r') as f:
    data = yaml.safe_load(f)

print(f"Name: {data['person']['name']}")
print(f"Age: {data['person']['age']}")
print("Hobbies:")
for hobby in data['person']['hobbies']:
    print(f"- {hobby}")

Name: Alice
Age: 30
Hobbies:
- reading
- hiking
- swimming


## Exception Handling

Try to open `files/missing.txt`. If there's an error related to the file not existing, print "The file does not exist." and set the contents of the file to `None`.

In [None]:
try:
    with open('files/missing.txt', 'r') as file:
        content = file.read()
        print(content)
except FileNotFoundError:
    print("The file does not exist.")
    content = None

The file does not exist.


## Word Frequency

Calculate the frequency of each word in the `files/bigtext.txt` file and determine the top 10 most frequent words.

In [None]:
word_counts = {}

with open('files/bigtext.txt', 'r') as f:
    text = f.read().lower()
    # Removing punctuation and splitting by whitespace
    words = [word.strip('.,!?()[]{}":;') for word in text.split()]
    
    for word in words:
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1

# Sort the dictionary by value to get the top 10 words
sorted_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)
top_10 = sorted_words[:10]

top_10

[('a', 2),
 ('long', 2),
 ('feel', 1),
 ('free', 1),
 ('to', 1),
 ('use', 1),
 ('an', 1),
 ('excerpt', 1),
 ('from', 1),
 ('book', 1)]

## Nested Parsing

List the IDs of users who have more than 5 incomplete tasks in the `files/todos.json` file.

In [47]:
with open('files/todos.json', 'r') as f:
    todos = json.load(f)

# Count incomplete tasks for each user
incomplete_counts = {}
for todo in todos:
    if not todo['completed']:
        if todo['userId'] in incomplete_counts:
            incomplete_counts[todo['userId']] += 1
        else:
            incomplete_counts[todo['userId']] = 1

for userId, count in incomplete_counts.items():
    if count > 5:
        print(f'User {userId} has {count} incomplete tasks')

User 1 has 9 incomplete tasks
User 2 has 12 incomplete tasks
User 3 has 13 incomplete tasks
User 4 has 14 incomplete tasks
User 5 has 8 incomplete tasks
User 6 has 14 incomplete tasks
User 7 has 11 incomplete tasks
User 8 has 9 incomplete tasks
User 9 has 12 incomplete tasks
User 10 has 8 incomplete tasks


## Log Analyzer

Given a log file `files/logs.txt` with entries of the form `[Timestamp] [LOG LEVEL] Message`, extract and count the number of occurrences of each log level (e.g., INFO, WARN, ERROR).

In [None]:
def analyze_log(file_path):
    log_levels = {"INFO": 0, "WARN": 0, "ERROR": 0}
    
    with open(file_path, 'r') as f:
        lines = f.readlines()
        for line in lines:
            for level in log_levels:
                if level in line:
                    log_levels[level] += 1
    return log_levels

In [None]:
# Usage
log_counts = analyze_log('files/logs.txt')
for level, count in log_counts.items():
    print(f"{level}: {count}")

INFO: 12
WARN: 4
ERROR: 6


## Posts

In the following exercises use the files:
- `files/users.json`
- `files/posts.json`
- `files/comments.json`

In [89]:
# Read data
with open('files/users.json', 'r') as f:
    users = json.load(f)
with open('files/posts.json', 'r') as f:
    posts = json.load(f)
with open('files/comments.json', 'r') as f:
    comments = json.load(f)

### Compare Data

Find and print the `postId` values that are present in both the posts and the comments in `files/posts.json` and `files/comments.json` files.

In [90]:
post_ids = {post['id'] for post in posts}
comment_post_ids = {comment['postId'] for comment in comments}

# Common post IDs
post_ids.intersection(comment_post_ids)

{1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100}

### User Activity Logs

- Read `files/users.json` and write each user's name and email to a new `files/users/users_names.txt` file.
- Read `files/posts.json` and segregate them by users. Write each user's posts to a separate file named after them: `files/users/{user_id}_{user_name}.txt`.

In [91]:
import os

users_directory = 'files/users/'
if not os.path.exists(users_directory):
    os.mkdir(users_directory)

In [92]:
# Write user names and emails to users.txt
file_path = os.path.join(users_directory, 'user_names.txt')
with open(file_path, 'w') as users_file:
    for user in users:
        users_file.write(f"Name: {user['name']}, Email: {user['email']}\n")

In [93]:
with open('files/posts.json', 'r') as f:
    posts = json.load(f)

# Segregate posts by users and write to respective files
for user in users:
    user_posts = [post for post in posts if post['userId'] == user['id']]

    file_name = f"{user['id']}_{user['name'].replace(' ', '_')}.txt"
    file_path = os.path.join(users_directory, file_name)
    with open(file_path, 'w') as post_file:
        for post in user_posts:
            post_file.write(f"Title: {post['title']}\n")
            post_file.write(f"Body: {post['body']}\n\n")

### Post Length Analysis

Analyze the posts and create a file, `post_length_analysis.txt`, that categorizes posts based on their length: "Short" (0-50 chars), "Medium" (51-200 chars), and "Long" (>200 chars). List the number of posts in each category.

In [94]:
short_posts = 0
medium_posts = 0
long_posts = 0

for post in posts:
    length = len(post['body'])
    if length <= 50:
        short_posts += 1
    elif 50 < length <= 200:
        medium_posts += 1
    else:
        long_posts += 1

with open('files/users/post_length_analysis.txt', 'w') as file:
    file.write(f"Short posts: {short_posts}\n")
    file.write(f"Medium posts: {medium_posts}\n")
    file.write(f"Long posts: {long_posts}\n")

### Email Domain Counter

Analyze the email addresses of all users and list the frequency of each email domain (e.g., @example.com) in a file called `email_domains.txt`.

In [95]:
from collections import defaultdict

email_domains = defaultdict(int)

for user in users:
    domain = user['email'].split('@')[1]
    email_domains[domain] += 1

with open('files/users/email_domains.txt', 'w') as file:
    for domain, count in email_domains.items():
        file.write(f"{domain}: {count}\n")

### Most Engaged Users

Identify the top 3 users who've made the most posts and received the highest number of comments on their posts. Generate a `most_engaged_users.json` that contains the user details, total posts, and total comments received.

In [97]:
user_engagement = []

for user in users:
    user_posts = [post for post in posts if post['userId'] == user['id']]
    post_ids = [post['id'] for post in user_posts]
    total_comments = sum(1 for comment in comments if comment['postId'] in post_ids)
    user_engagement.append({
        'User': user['name'],
        'Posts': len(user_posts),
        'Comments': total_comments
    })

pd.DataFrame(user_engagement)\
    .sort_values(by=['Comments', 'Posts'], ascending=False)\
    .head(3)\
    .to_json('files/users/most_engaged_users.json', orient='records')

### Geo-Post Analysis

Analyze where posts are coming from geographically. For each city where users reside, calculate the average number of posts made. Generate `city_post_avg.json` containing each city and its average post count.

In [98]:
city_post_count = {}

for user in users:
    user_posts = [post for post in posts if post['userId'] == user['id']]
    if user['address']['city'] in city_post_count:
        city_post_count[user['address']['city']] += len(user_posts)
    else:
        city_post_count[user['address']['city']] = len(user_posts)

avg_post_count = {city: post_count/len(users) for city, post_count in city_post_count.items()}

with open('files/users/city_post_avg.json', 'w') as file:
    json.dump(avg_post_count, file)

### Post Interconnectivity

Check if users mention other users in their posts (based on usernames). Generate a `user_mentions.json` file that lists for each user which other users they've mentioned the most across all their posts.

In [99]:
mentions = {}

for user in users:
    mentions[user['username']] = {}
    user_posts = [post for post in posts if post['userId'] == user['id']]
    for post in user_posts:
        for other_user in users:
            if other_user['username'] in post['body'] and other_user['username'] != user['username']:
                mentions[user['username']].setdefault(other_user['username'], 0)
                mentions[user['username']][other_user['username']] += 1

most_mentions = {}
for user, mention_count in mentions.items():
    sorted_mentions = sorted(mention_count.items(), key=lambda x: x[1], reverse=True)
    if sorted_mentions:
        most_mentions[user] = sorted_mentions[0]

with open('files/users/user_mentions.json', 'w') as file:
    json.dump(most_mentions, file)

# Bonus

## File Comparison Tool

Create a tool that takes in the paths of two text files and outputs the lines that differ between them.

In [None]:
def compare_files(file1, file2):
    with open(file1, 'r') as f1, open(file2, 'r') as f2:
        lines1 = f1.readlines()
        lines2 = f2.readlines()

    differences = []

    for l1, l2 in zip(lines1, lines2):
        if l1 != l2:
            differences.append((l1, l2))

    return differences

In [None]:
# Usage (no differences expected):
differences = compare_files('files/users/1_Leanne_Graham.txt', 'files/users/1_Leanne_Graham.txt')
for diff in differences:
    print(f"File 1: {diff[0]}File 2: {diff[1]}")

##  File Metadata Extractor

Create a Python program that navigates through every file and folder in a given directory (recursively), extracts metadata from each file, and then saves this metadata to a CSV file. The metadata to capture for each file: File Name, File Path, File Size, Last Modified Date.

In [79]:
import os
import csv
import datetime

def extract_metadata(directory, csv_file):
    with open(csv_file, 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(["File Name", "File Path", "File Size (Bytes)", "Last Modified Date"])
        
        for foldername, subfolders, filenames in os.walk(directory):
            for filename in filenames:
                filepath = os.path.join(foldername, filename)
                filesize = os.path.getsize(filepath)
                timestamp = os.path.getmtime(filepath)
                # Convert timestamp to datetime object
                dt_object = datetime.datetime.fromtimestamp(timestamp)
                # Convert datetime object to string
                last_modified = dt_object.strftime('%Y-%m-%d %H:%M:%S')
                writer.writerow([filename, filepath, filesize, last_modified])

In [80]:
# Usage
extract_metadata('files/', 'files/metadata.csv')

## Stack Overflow Survey Analysis

Unzip the datasets contained in `files/stackoverflow/`.

These datasets contain responses to the annual Stack Overflow survey, including various aspects like the type of developer, education, job satisfaction, and more.

**Objective**:
Analyze the Stack Overflow Developer Survey data to gain insights into the development community and its trends. This will involve integrating the yearly datasets, extracting relevant insights, visualizing the results, and creating a comprehensive PDF report.

**Tasks**:


1. File Discovery & Data Exploration:
    - Download and unzip the dataset folder.
    - Programmatically discover and read the yearly survey files.
    - Familiarize yourself with the structure and columns in the datasets.

2. Data Integration & Cleaning:
    - Handle missing values and inconsistencies between yearly datasets.
    - Integrate data for the past 5 years into one consolidated dataset, taking care of column differences and inconsistencies.

3. Analysis:
    - Identify the top 5 most popular programming languages for the past 5 years.
    - Calculate the median salary for developers in the US, Europe, and Asia for each year.
    - Determine the percentage of remote workers over the years.

4. Visualizations:
    - Plot the trends of the top 5 programming languages over the past 5 years.
    - Create a bar chart comparing the median salaries in the US, Europe, and Asia for each year.
    - Generate a pie chart showing the distribution of developers based on their highest level of formal education.

5. Generate PDF Report:
    - Create a comprehensive report detailing the above findings.
    - Include the charts generated in the visualizations step.
    - Save the report as a StackOverflow_Survey_Analysis.pdf.