In [None]:
! mkdir ../data
! wget https://storage.googleapis.com/usc-data/newsworthiness-project/final-matching-articles-and-meetings.csv
! wget https://storage.googleapis.com/usc-data/newsworthiness-project/full-meeting-data-with-headers.jsonl.zip
! wget https://storage.googleapis.com/usc-data/newsworthiness-project/full_newsworthiness_training_data.jsonl
! wget https://storage.googleapis.com/usc-data/newsworthiness-project/sfchron-fetched-articles.jsonl.zip

In [1]:
# unzip these files and put them in the ../data directory
# ... 

# Reading in City Council Minutes

First, we read in the transcribed city city council meeting minutes. These were all obtained by downloading videos from these pages: https://sanfrancisco.granicus.com/player/clip/43243?view_id=10&meta_id=992444&redirect=true&h=b9650faa96d53d07034d556d92f18771

In [2]:
import os 
import glob
import pandas as pd 
city_council_data_df = pd.read_json('../data/full_newsworthiness_training_data.jsonl',  lines=True)

These city council minutes have a `label` column that is `True` or `False`. This label column indicates whether or not the city council policy has been written about in news articles.

In [3]:
city_council_data_df.head()

Unnamed: 0,clip_id,index,class_name,text,time,proposal_number,end_time,header,transcribed_text,label,date
0,16637,18,agenda0,130031 Pursuant to Charter Sections 2.103 and ...,633.004,130031,635.705,COMMUNICATION,"[{'text': ' Colleagues, Madam Clerk, why don't...",False,2013-01-15
1,16637,24,agenda0,"121007 Ordinance authorizing, pursuant to Char...",903.883,121007,918.88,CONSENT AGENDA,"[{'text': 'Madam Clerk, could you please call ...",False,2013-01-15
2,16637,25,agenda0,121064 Ordinance amending the San Francisco Bu...,918.88,121064,942.348,CONSENT AGENDA,"[{'text': ' Thank you.', 'speaker': 'SPEAKER_2...",False,2013-01-15
3,16637,26,agenda0,121139 Resolution approving and authorizing th...,942.348,121139,959.0,CONSENT AGENDA,[{'text': 'This has been kind of a longstandin...,False,2013-01-15
4,16637,41,agenda0,"120997 Ordinance appropriating $843,000 of Sta...",959.0,120997,959.0,CONSENT AGENDA,"[{'text': 'Last year alone, I'm sorry, since 2...",False,2013-01-15


The transcribed text of the city council meeting looks like this, with time-stamps and a speaker ID (these are randomly assigned, but used to track the speaker across the entire course of the meeting.)

In [4]:
city_council_data_df['transcribed_text'].iloc[1]

[{'text': 'Madam Clerk, could you please call item 12?',
  'speaker': 'SPEAKER_29',
  'start': 902.02,
  'end': 903.883},
 {'text': ' Item 12 is an ordinance appropriating $843,000 of state reserves and approximately $1.4 million from school districts set aside funds for the San Francisco Unified School District for fiscal year 2012 through 2013.',
  'speaker': 'SPEAKER_28',
  'start': 903.883,
  'end': 915.267},
 {'text': 'Supervisor Kim.',
  'speaker': 'SPEAKER_28',
  'start': 915.267,
  'end': 918.44},
 {'text': ' Thank you.',
  'speaker': 'SPEAKER_24',
  'start': 918.44,
  'end': 918.88},
 {'text': 'I realize that we are now finally coming to near end on discussion around the supplemental appropriation, and I just want to take a moment to thank my co-sponsors, Supervisors Campos, Marr, and Avalos, and also to Supervisors Cohen and Chu for your support for this supplemental.',
  'speaker': 'SPEAKER_24',
  'start': 918.88,
  'end': 937.206}]

# Meetings and Articles

Now, let's go deeper into the data we did in the last project, which shows the news articles along with the city council proposal text:

In [5]:
final_matching_df = pd.read_csv('../data/final-matching-articles-and-meetings.csv', index_col=0)

In [6]:
# this is a snipped of the news article that covered the city council minute

print(final_matching_df['summary_text'].iloc[0])

On Tuesday, the San Francisco Board of Supervisors will for the second time this summer consider amendments to the Airbnb law regulating short-term rentals to fix the mess former Supervisor David Chiu's industry-sponsored bill created last year. The Board of Supervisors has an opportunity to find a compromise to show that a city can fully embrace home sharing and also restrict landlords from operating illegal hotels. Six months of reports and analyses from government agencies, the media, Airbnb and an academic study provide the data that the Board of Supervisors needs to support an ordinance that facilitates legal short-term renting while giving government agencies tools for proactively regulating away the bad actors. The Board of Supervisors needs to find a reasonable compromise on home sharing so that we can get back to confronting San Francisco's larger housing challenge.

On Tuesday, the San Francisco Board of Supervisors will for the second time this summer consider amendments to 

In [7]:
# this is the proposal text

final_matching_df['meeting text'].iloc[0]

"Administrative Code - Short-Term Residential Rentals. Ordinance amending the Administrative Code to revise the Residential Unit Conversion Ordinance to: revise the definition of interested parties who may enforce the provisions of Chapter 41A, through a private right of action to include permanent residents residing within 100 feet of the residential unit; create an additional private right of action under certain circumstances; change the administrative hearing process from mandatory to at the request of any party found in violation of this Chapter; create an Office of Short-Term Residential Rental Administration and Enforcement staffed by the Planning Department, Department of Building Inspection, and Tax Collector's Office; and affirming the Planning Department's determination under the California Environmental Quality Act."

In [8]:
# these are how many city council minutes were covered

city_council_data_df['label'].value_counts()

label
False    15799
True      1301
Name: count, dtype: int64

# Additional information (if needed)

Here is some more information about both the meeting and the news articles if you need them.

In [41]:
# full article information/ text
# the original final_matching_df doesn't have the full article text, 
# so you might want to look at the actual text

json_file = '../data/sfchron-fetched-articles.jsonl'
articles = []
import json
for line in open(json_file):
    articles.append(json.loads(line))

sf_articles_df = pd.DataFrame(articles)
final_matching_df['key'] = (final_matching_df['article_url']
     .str.split(')')
     .str.get(-1)
     .str.replace('https://', 'http://')
     .str.replace('www.', '')
     .str.replace('http://sfchronicle.com', '')
)

In [80]:
matching_df_with_full_text = (
    sf_articles_df
         .assign(key=lambda df: df['article_url'].str.split('sfchronicle.com').str.get(-1))
         [['key', 'article_text']]
         .merge(final_matching_df, on='key', how='right')
)

matching_df_with_full_text.head()

Unnamed: 0,key,article_text,city,committee,meeting date,File #,meeting text,summary_text,article_url,article_publish_date
0,/opinion/openforum/article/how-to-fix-san-fran...,,San Francisco,Board of Supervisors,2015-07-21,150363,Administrative Code - Short-Term Residential R...,"On Tuesday, the San Francisco Board of Supervi...","com,sfchronicle)/opinion/openforum/article/com...",2015-07-12 21:00:00+00:00
1,/bayarea/article/Keeping-S-F-light-rail-on-tra...,Photo: Siemens\n\nThe Municipal Transportation...,San Francisco,Board of Supervisors,2014-11-18,141197,Hearing - Update on the Municipal Transportati...,If the Board of Supervisors approves the contr...,http://sfchronicle.com/bayarea/article/Keeping...,2014-11-25 02:45:00+00:00
2,/bayarea/heatherknight/article/san-francisco-l...,,San Francisco,Board of Supervisors,2020-09-01,200884,Affirming the Statutory Exemption From Environ...,This would be a good time for the Board of Sup...,"com,sfchronicle)/bayarea/heatherknight/article...",2020-09-12 11:00:00+00:00
3,/bayarea/article/mayor-ups-proposed-housing-bo...,,San Francisco,Board of Supervisors,2015-05-19,150503,Committee of the Whole - Urgency Ordinance - Z...,"That's welcome news to Mission residents, some...","com,sfchronicle)/bayarea/article/com,sfchronic...",2015-06-09 01:31:04+00:00
4,/sf/article/breed-s-tenderloin-emergency-s-f-r...,,San Francisco,Board of Supervisors,2022-02-15,220155,Concurring in Actions to Meet Local Emergency ...,The city put out a news release touting progre...,"com,sfchronicle)/sf/article/com,sfchronicle)/s...",2022-02-08 17:35:27+00:00


In [74]:
## full meeting information

full_meeting_data_with_headers =  pd.read_json('../data/full-meeting-data-with-headers.jsonl', lines=True)
full_meeting_data_with_headers.head(2)

Unnamed: 0,clip_id,index,class_name,text,time,proposal_number,end_time,header,transcribed_text,speakers
0,16593,0,agenda0,1 ROLL CALL AND PLEDGE OF ALLEGIANCE,122.0,,123.0,ROLL CALL,[{'text': 'The first is a communication from t...,[SPEAKER_27]
1,16593,1,agenda0,2 AGENDA CHANGES,123.0,,123.0,AGENDA CHANGE,[{'text': 'The first is a communication from t...,[SPEAKER_27]


# Demo for Merged DF

In [81]:
renamed_article_matched_df = matching_df_with_full_text.rename(columns={
    'meeting text': 'policy text',
    'summary_text': 'article summary text',
    'article_text': 'article full text'
})

In [82]:
renamed_city_council_data_df = city_council_data_df.rename(columns={
    'text': 'policy text',
    'transcribed_text': 'meeting transcribed text'
})

In [90]:
full_merged_df = (
    renamed_article_matched_df[['File #', 'article full text', 'article summary text']]
         .merge(
             right=renamed_city_council_data_df[['proposal_number', 'policy text', 'meeting transcribed text', 'label']], 
             left_on='File #',
             right_on='proposal_number', 
             how='right'
         )
).drop(columns='File #')

In [92]:
full_merged_df.sample(5)

Unnamed: 0,article full text,article summary text,proposal_number,policy text,meeting transcribed text,label
12762,,,191286,191286 Public Trust Exchange Agreement - Calif...,"[{'text': 'Madam Clerk, next item.', 'speaker'...",False
14686,,,201395,201395 Official Naming of Unnamed Streets - Se...,"[{'text': 'Madam Clerk, please call item numbe...",False
9751,London Breed wins SF mayor’s race as Mark Leno...,Photo: Jessica Christian / The Chronicle Image...,180719,180719 Declaration of Election Results - June ...,"[{'text': 'Tang, aye.', 'speaker': 'SPEAKER_49...",True
7060,,,170100,"170100 Real Property Lease - SPOK, Inc. - Zuck...","[{'text': ' For both leases at Zuckerberg, San...",False
7876,,,170442,"170442 Public Works, Administrative Codes - Re...","[{'text': 'Madam Clerk, can we return to item ...",False


In [None]:
# here's how you might combine the `policy text` and the `meeting transcribed text` columns:

In [105]:
full_merged_df_w_full_policy_text = (
    full_merged_df
     .assign(meeting_transcribed_text_col = lambda df:
             df.apply(lambda x: list(map(lambda y: y['text'], x['meeting transcribed text'])), axis=1)
            )
     .assign(full_policy_text=lambda df: 'policy text:\n\n' + df['policy text'] + '\n\n' + 'meeting text:\n\n' + df['meeting_transcribed_text_col'].str.join('\n'))
     .drop(columns=['meeting transcribed text', 'meeting_transcribed_text_col'])
)

In [106]:
print(full_merged_df_w_full_policy_text['full_policy_text'].iloc[1])

policy text:

121007 Ordinance authorizing, pursuant to Charter Section 9.118(a), a System Impact Mitigation Agreement with North Star Solar, LLC, requiring North Star Solar, LLC, to pay the Public Utilities Commission the costs necessary to mitigate the impacts to the City’s electric system caused by the interconnection of North Star Solar, LLC’s solar project to the electric grid; authorizing similar mitigation agreements with other projects in the future; appropriating funds from these agreements to pay the costs of mitigation work; and placing various mitigation funds on reserve with the Board of Supervisors.

meeting text:

Madam Clerk, could you please call item 12?
 Item 12 is an ordinance appropriating $843,000 of state reserves and approximately $1.4 million from school districts set aside funds for the San Francisco Unified School District for fiscal year 2012 through 2013.
Supervisor Kim.
 Thank you.
I realize that we are now finally coming to near end on discussion around t