## Scrape text messages from Teams chat (within a channel)

This notebook uses Google Chrome browser, selenium, and chromedriver to scrape the text 
messages (with timestamp and author) from a chat between members within a teams channel. There is a manual step to enter credentials and two-factor authentication, but once navigated to the correct chat it will automatically scroll back through the history and extract
the Timestamp, Author, and Message and save these to three seperate lists. Finally, there are cells to load these lists into a dataframe and save to Excel format.

This notebook *does not* extract or save images shared within the chat.

### Requirements:
- pip install selenium
- Chrome browser installed
- chromedriver downloaded

### Resources: 

This notebook has been adapted from the following resources which describe the process to extract 'Posts' text from a Teams channel (which uses different selectors to the method in this notebook).

- https://blog.stackademic.com/unlocking-data-scraping-teams-channel-using-selenium-in-python-369a390b07e9
- https://github.com/alenagorb/teams_channel_scrape_selenium/blob/main/General_Teams_Channel_Scraper.ipynb

In [None]:
#Imports

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains

import time
from datetime import datetime
from zoneinfo import ZoneInfo

#Load the instance of Chrome Driver from local disk drive
opts = webdriver.ChromeOptions()

# UPDATE this folder directory where your chromdriver is saved
serv = Service(r"C:\\chromedriver-win64\chromedriver-win64\chromedriver.exe") 
driver = webdriver.Chrome(service=serv, options=opts)
driver.maximize_window() # Maximize the browser window
time.sleep(5)

#Open Teams webpage in Chrome
driver.get('https://teams.microsoft.com/v2/') #Teams URL
time.sleep(3)


In [None]:

''' 
This cell is optional but speeds up the Chrome load process if run with the previous cell. 
It populates the username field and submits which redirects to organisation sign-in page 
   where you will need to manually enter your credentials and complete the 2FA
'''

#Target username field
username = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='loginfmt']")))

#Enter username
username.clear() # Clear exisiting characters
username.send_keys("youremail@yourorganisation.com") # UPDATE to your username for teams

# Target the next button and click it
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[type='submit']"))).click()
time.sleep(3)


### Manual steps:
- Enter credientials and 2FA
- Change to appropriate teams channel (top right of window)
  

### Select the correct chat in the left menu


In [None]:

# Optional cell - Click on target chat (this is hardcoded for a particular user)

target_chat = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="title-chat-list-item_19:62c95a03-91f0-49db-bb16-fa0a0a16b5b7_c3302ba1-a3d6-46e4-9641-1a0b379934e0@unq.gbl.spaces"]' ))).click()


In [None]:

# Select the chat pane

xpath = '//*[@id="chat-pane-list"]'
chatArea = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, xpath )))
chatArea.click()


### Main scraping code

1. Ensure the chat to be scraped is selected (cell above)
2. Scroll to the bottom of the chat (most recent message)
3. Run cell below and do not interact (scroll or click) the Chrome browser that contains the Teams chat. It is ok to open other windows on top.

In [None]:

message_list = []
author_list = []
timestamp_list = []

'''

When output is enabled, the message timestamp is added to the list and displayed.
If there are no more messages to add, the output will show as << SCROLL count: 285, len: 5420, 5420, 5420 >>.
Here, 'count' will keep increasing, but the list length remains the same.
This pattern will repeat 3 times before a break is triggered and the message "End of message history" is displayed

'''

no_new_messages = 0
scroll_count = 0

while True:
    
    # Look for all the elements that match the selectors for message, author, and timestamp
    new_message_elements = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[data-tid="chat-pane-message"]')))
    new_author_elements = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[data-tid="message-author-name"]')))
    new_timestamp_elements = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'time[datetime]')))

    # Get the text strings from these elements
    new_message_strings = [message.text for message in new_message_elements]
    new_author_strings = [author.text for author in new_author_elements]
    new_timestamp_strings = [
        datetime.fromisoformat(time.get_attribute('datetime').replace('Z', '+00:00'))
        .astimezone(ZoneInfo('Australia/Adelaide')) #Update with your timezone
        .strftime('%Y-%m-%d %H:%M:%S %Z')
        for time in new_timestamp_elements
    ]

    # Check what elements have been added to the list
    
    # this loop checks if the timestamp of the new lot of messages is present in the saved list. If a timestamp is not present then the 
    # the item is added to the timestamp, author, and message lists
    for j, item in enumerate(new_timestamp_strings):
        
        if item not in timestamp_list: 
            
            no_new_messages = 0
            # print(message_times[j])
            print(item)
            
            timestamp_list = timestamp_list + [new_timestamp_strings[j]]
            author_list = author_list + [new_author_strings[j]]
            message_list = message_list + [new_message_strings[j]]

    # This scrolls up by simulating the pressing of the HOME key.
    WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.TAG_NAME, "body"))).send_keys(Keys.HOME)
    
    # Optional output to monitor progress and list lengths (i.e. will cause mismatch error later)
    print(f'SCROLL count: {scroll_count}, len: {len(timestamp_list)}, {len(author_list)}, {len(message_list)}')
   
    scroll_count += 1  
    no_new_messages += 1
    print("NNM: ", no_new_messages)
    
    time.sleep(3)

    # Attempt to scroll up 3 times before calling break
    if no_new_messages == 3:
        print("End of message history")
        break
        
    

In [None]:

# Optional cell - Check final list lengths are all the same
for item in [messages,authors,timestamps]:
    print(len(item))
    

### Create Dataframe from lists

In [None]:

import pandas as pd
df = pd.DataFrame({'Time': timestamps, 'Author': authors, 'Message': messages})
df = df.drop_duplicates(subset=['Time']).sort_values(by='Time', ascending=True).reset_index(drop=True)
# df_sorted = df
# df_sorted
df


### Save Dataframe to Excel

In [None]:

from datetime import datetime
now = datetime.now()

# UPDATE path and filename to suit
path = r'C:\your_filepath\\'
filename = 'Chat_Archive_Filename'

date_time = now.strftime("%Y%m%d")

# to_excel with xlsx format preserves the emojis in message txt
df.to_excel(f'{path}{filename}_{date_time}.xlsx', index=False, header=True) 
