# Recitation 7

This week we will go over how to work with reading/writing CSV files. Happy early Halloween!

In [None]:
!curl -L  https://rutgers.box.com/shared/static/r4tvnv8hxa0ze3przujngdcz8ezcj61a --output witnesses.csv
import csv

## Breaking Bad (Rutgers' Version)

You just accepted an offer with the office of Saul Goodman, Attorney at Law Congrats! Your first task is to redact this witnesses csv document listing the witnesses of a murder investigation. 
- Replace all the digits in the addresses with **** but leave any letters
- Replace the digits in the phone number with *
- Replace the email user name with ****
- Output the redacted file with the name redacted_witnesses.csv

Your boss Saul Goodman would also like to know where most of the witnesses live... just for educational purposes
- Find the most common state mentioned in the CSV file

Lastly Saul has asked to find all the witnesses who work for law firms, return a list of all the witnesses who have any of the following key words in the company_name column
- Esq
- Law
- Attorney

<img src="https://media.tenor.com/yg7X7jx8kowAAAAd/my-reaction-to-that-information-saul-goodman.gif" alt="meme" width="200"/>

In [None]:
def redact_address(address):
    pass

def redact_phone(phone):
    pass

def redact_email(email):
    pass

### Task 1 Redaction
```python
def redact_address(address):
    words = address.split()
    redacted_words = []
    for word in words:
        if word.isalpha():
            redacted_words.append(word)
        else:
            redacted_words.append('*****')
    return ' '.join(redacted_words)

def redact_phone(phone):
    redacted_phone = ''.join(['*' if c.isdigit() else c for c in phone])
    return redacted_phone

def redact_email(email):
    user_name, domain = email.split('@')
    redacted_email = '****@' + domain
    return redacted_email

with open("witnesses.csv", 'r') as infile, open("redacted_witnesses.csv", 'w', newline='') as outfile:
    reader = csv.DictReader(infile)
    fieldnames = reader.fieldnames
    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    writer.writeheader()

    for row in reader:
        # Redact the address, phone, and email fields
        row['address'] = redact_address(row['address'])
        row['phone1'] = redact_phone(row['phone1'])
        row['phone2'] = redact_phone(row['phone2'])
        row['email'] = redact_email(row['email'])
        writer.writerow(row)
```

In [None]:
## Find the most common state mentioned in the CSV file

### Task 2 Most Popular State Answer
```python
states = {}
with open("witnesses.csv", 'r') as infile:
    reader = csv.DictReader(infile)
    for row in reader:
        states.update({row['state']: states.get(row['state'], 0) + 1})

most_common_state = max(states, key=lambda s: states[s])
print(f"Most common state mentioned: {most_common_state}")
```
*Question:* While a lambda function is certainly not necessary here, it helps cut down on the amount of code that we write. Discuss some other solutions here rather than using a lambda function.

In [None]:
# Return a list of all the witnesses who have any of the following key words in the company_name column: Esq, Law, Attorney

### Task 3 Lawyer Employees Answer

```python
law_firm_keywords = ['Esq', 'Law', 'Attorney']
law_firm_witnesses = []

with open("witnesses.csv", 'r') as redacted_file:
    reader = csv.DictReader(redacted_file)
    for row in reader:
        company_name = row['company_name']
        for keyword in law_firm_keywords:
            if keyword in company_name:
                law_firm_witnesses.append(row)

print("Witnesses who work for law firms:")
for witness in law_firm_witnesses:
    print(f"{witness['first_name']} {witness['last_name']}, {witness['company_name']}")
```
*Question:* Can we use list comprehension or the map function to cut down on the amount of code in the forloop? Discuss amongst yourselves

In [None]:
!curl -L  https://rutgers.box.com/shared/static/uwan8c054x7a2i9b5l3gt5laoymgfqg1 --output spotify.csv

## Spotify Pop Song Extravaganza!

Ever watch RuPaul's Drag Race and wonder who decided the *lip sync for your life* songs? Yeah super niche, I know, but you've just been hired by the production crew as the new RuPaul's Drag Race Lip Sync for Your Life Song Senior Analyst.
- Find songs with a danceability score of over 0.95 and print the names of the songs and the link to it, additionally write the row to a new csv file, add a column to the link of the spotify song, the link can be generated by adding 'https://open.spotify.com/track/' in front of the spotify track ID. Do the same with the album_id column except this time add 'https://open.spotify.com/album/' before the track_album id.
- Take the CSV file that you just created and load it into a pandas dataframe, then sort this dataframe based on the track popularity column and print the head and tail so that we can view the least and most popular songs
- Your boss RuPaul is interested as to why Taylor Swift's discography has been so popular amongst every demographic of people, look at songs by Taylor Swift and print the average speechiness, acousticness, instrumentalness, liveness, valence, tempo, and duration_ms

<img src="https://media.tenor.com/VGRDkgs5YAkAAAAM/iconic-rupaul.gif" alt="meme" width="200"/>

In [None]:
### Do some magic here

### Task 1 Find Danceable Songs Answer
```python
most_danceable_songs = []

with open("spotify.csv", 'r', newline='') as infile, open("spotify_links.csv", 'w', newline='') as outfile:
    data_with_links = []
    reader = csv.DictReader(infile)
    fieldnames = reader.fieldnames
    fieldnames.extend(["track_link", "album_link"])
    new_rows = []
    
    for row in reader:
        # Convert danceability to a float for comparison
        danceability = float(row["danceability"])
        
        # Check if danceability is 100
        if danceability >= 0.95:
            # Construct the Spotify song and album links
            spotify_song_link = f"https://open.spotify.com/track/{row['track_id']}"
            spotify_album_link = f"https://open.spotify.com/album/{row['track_album_id']}"

            filtered_data = row
            filtered_data.update({
                "track_link": spotify_song_link,
                "album_link": spotify_album_link
            })
            new_rows.append(row)
            most_danceable_songs.append((row['track_name'], spotify_song_link))
    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(new_rows)

print("Songs with danceability score of 100:")
for song in most_danceable_songs:
    print(f"{song[0]} {song[1]}")
```

In [None]:
# Take the CSV file that you just created and load it into a pandas dataframe, then sort this dataframe based on the track popularity column and print 
# the head and tail so that we can view the least and most popular songs

### Task 2 Sort Songs by Popularity Answer
```python
import pandas as pd
songs_df = pd.read_csv('./spotify_links.csv')

sorted_songs_df = songs_df.sort_values(by='track_popularity', ascending=False)
sorted_songs_df.head()
```
*Question:* Why is it that the highest rank is 81? Should it not be 100? What did we modify?

In [None]:
# Your boss RuPaul is interested as to why Taylor Swift's discography has been so popular amongst every demographic of people, look at songs by 
# Taylor Swift and print the average speechiness, acousticness, instrumentalness, liveness, valence, tempo, and duration_ms

### Task 3 Uncovering Taylor Swift Answer
```python
keys = ['speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']
count = 0
sums = {key: 0 for key in keys}

with open("spotify.csv", 'r', newline='') as infile:
    data_with_links = []
    reader = csv.DictReader(infile)
    fieldnames = reader.fieldnames
    
    for row in reader:
        if row['track_artist'] == 'Taylor Swift':
            for k in keys:
                sums[k] += float(row[k])
                count += 1

print("Taylor Swift's stats: ")
for k in keys:
    print(f'Average {k}: {sums[k]/count}')
```