# Instructor Feedback for Mid-Semester Dataset Update and Group Assignments To Date for Group 1

Please find below three sections of feedback regarding your mid-semester dataset update and group assignments. Overall, I'm glad to see progress on the project and it seems like you are working together, which is great. I also think your choice of topic remains a rich one and I'm excited to see how you're working together to realize your proposed project. However, I am have some questions about the division of labor, as well as the overall scope and documentation of your project. As a reminder, your goal is try and create something similar to the datasets we have been exploring on the *Responsible Datasets in Context Project*. While I do not expect what you to produce to be as polished as what is available on the website, considering there are four of you in the group, I am hoping you can clarify some of your current choices and provide more detail on how you are envisioning your final project and achieve the stated project goals and requirements.

In terms of returning feedback to the Instructor, you have two options: create a new GitHub issue responding to my questions and tag me in the issue, or complete the Google form, available via Canvas, where you can update your project plan and respond to my questions or offer suggested corrections to the stated assessments. You are also welcome to use the form to provide feedback that will not be shared with the rest of your group if you think that would be helpful or you want to share something with me privately.

## Load Libraries and Datasets

To run this notebook, you will need to have `pandas`, `altair`, and `rich` installed. You can find instructions on how to do so in our course website.

In [1]:
import pandas as pd
import altair as alt
from rich.console import Console
from rich.table import Table

console = Console()

# Load the data
contributors_df = pd.read_csv("./datasets/contributors_group_1.csv")
commits_df = pd.read_csv("./datasets/commits_group_1.csv")
issues_df = pd.read_csv("./datasets/issues_group_1.csv")

## Overall Group Feedback

Overall, your group seems to be working well together, though I notice some patterns below that I have questions about. From previous feedback, your GitHub repository still needs a `gitignore` file and I think it would be helpful to clean up the repository a bit, especially you're main `README.md` which does not really explain the structure of the repository at all currently. I did appreciate that you have a license, that you updated your repository description, and list the person who is the Project Manager on the `README.md`.

I will say that while I respect and understand you're current approach to divising labor, it does make it somewhat difficult for me to assess since it seems like primarily some group members have been doing the coding and issue management (at least based on the data you will see below), whereas others are doing more of the research and data entry intensive labor. If you could please just briefly confirm that the division of labor is relatively equitable on the Google form on Canvas then I'm happy to share the grades between all members of the group. But I would like to get that confirmation since it is not clear to me from the data I have available.

In the following graphs, you will see some of how your group has been working together from what I can tell via GitHub. I do not think this data represents all of your group's work or activities, so I would encourage you to both think about how to document work in more detail and also to contact the Instructors if you feel this data is not representative of your group's work and would like other aspects to be included in your assessment.

### Contributions Per Group Member

In [2]:
# Create a table for the contributions
contributors_table = Table(title="Contributors")
# Add columns with the contributors and the number of contributions
contributors_table.add_column("Contributor", style="cyan")
contributors_table.add_column("Number of Contributions", style="magenta")
# Sort the contributors by the number of contributions, with the highest first
contributors_df = contributors_df.sort_values(by="contributions", ascending=False)
# Loop through the contributors and add them to the table
for index, row in contributors_df.iterrows():
	# Add the contributor to the table and set the contributions to be a string
	contributors_table.add_row(row["login"], str(row["contributions"]))
# Print the table
console.print(contributors_table)

Above is the total number of contributions (that is commits) per group member. I see that there is a wide range of commits per group member, which is not necessarily a problem. But currently my assessment is that A.G. is putting in the most of the coding work into the project. Is this accurate or not? If so, is there a reason for this? I can't currently tell based off the division of labor you've detailed why A.G. would be doing more of this labor so please feel free to share more details. Also if there is not a rationale, is it possible to more equitably distribute the coding work? I trust you all to figure out the best way to do this, but I either need confirmation of the current division of labor or a revised plan to ensure that all group members are contributing equally to the project.

### Commits Over Time

In [3]:
# Melt the commits dataframe from wide to long format so that we can have all commit type activities in one column called commit_metric
melted_commits_df = commits_df.melt(id_vars=['oid', 'message', 'committedDate', 'login', ], value_vars=['additions', 'deletions', 'changedFiles'], var_name='commit_metric', value_name='commit_metric_value')
# Convert the commit date to a datetime object
melted_commits_df['commit_date'] = pd.to_datetime(melted_commits_df['committedDate'], errors='coerce')
# Get the unique commit types to create charts for each type
commit_types = melted_commits_df['commit_metric'].unique().tolist()

# Create a list to store the charts
charts = []
# Loop through the commit types and create a chart for each type
for commit_type in commit_types:
	# Create an interactive selection for each chart where you can select the login to highlight each group member's contributions
	selection = alt.selection_point(fields=['login'], bind='legend')
	
	# Filter the DataFrame for the current commit type and subset to only the columns we need to keep the chart smaller in size
	filtered_df = melted_commits_df[melted_commits_df['commit_metric'] == commit_type][['commit_date', 'commit_metric_value', 'login', 'message']]
	
	# Create a bar chart for the current commit type
	chart = alt.Chart(filtered_df).mark_bar().encode(
		x='commit_date:T', # Use the commit date as the x-axis
		y='commit_metric_value:Q', # Use the commit metric value as the y-axis
		color='login:N', # Color the bars by the login
		opacity=alt.condition(selection, alt.value(1), alt.value(0.1)), # Set the opacity to 1 if the login is selected and 0.1 if not
		tooltip=['commit_date', 'commit_metric_value', 'login', 'message'] # Show the commit date, commit metric value, login, and message in the tooltip
	).add_params(selection).properties(
		title=f"Commit {commit_type} Over Time"
	)
	# Add the chart to the list of charts
	charts.append(chart)
# Combine all the charts into one chart and set the y-axis to be independent so that we can see all the changes even if the y axis scale is different for each commit type activity
alt.hconcat(*charts).resolve_scale(y='independent')

When we look at the distribution of commit activities (so additions, deletions, and number of files changed), it does look like you have been working on the group GitHub repository somewhat consistently, which is good. But again, it looks like some members are doing the most work. While commits do not equate to quality of work, seeing this pattern makes me wonder about the division of labor in your group and if there's potential areas to document more clearly or to redistribute work.

### Issues and Project Management Over Time

In [4]:
# Once again melt the issues dataframe from wide to long format so that we can have all the issue dates in one column called issue_date
melted_issues_df = issues_df.melt(id_vars=['user.login', 'title', 'state', 'body', 'html_url', 'assignee', 'assignees_logins', 'labels', 'milestone', 'draft', 'comments', 'state_reason', 'closed_by.login', 'reactions.total_count', 'issue_duration', 'issue_associated_with_pull_request', 'pull_request.url', 'issue_status_on_project_board'], value_vars=['created_at', 'updated_at', 'closed_at'], var_name='issue_date_type', value_name='issue_date')

# Rename columns for because Altair does not let us use '.' in the column names
melted_issues_df = melted_issues_df.rename(columns={
    'user.login': 'user_login', 
    'closed_by.login': 'closed_by_login'
})

# Sort the DataFrame by issue_date
melted_issues_df = melted_issues_df.sort_values(by='issue_date')

# Get the unique issue titles
issue_titles = melted_issues_df['title'].unique().tolist()

# Initialize an empty list to store the charts
charts = []
# Loop through each issue title to create a chart for the issue
for title in issue_titles:
    # Create a selection so that we can highlight the contributions of each group member
    selection = alt.selection_point(fields=['user_login'], bind='legend')
    
    # Filter the DataFrame for the current issue title
    subset_data = melted_issues_df[melted_issues_df['title'] == title]
    
    # Initialize a list to store subtitles for the chart
    subtitle = []
    
    # Check if the issue is associated with a pull request
    has_pr = subset_data[subset_data['issue_associated_with_pull_request'] == True]
    if not has_pr.empty:
        # If the issue is associated with a pull request, get the URL of the pull request and add it to the subtitle
        pr_url = subset_data['pull_request.url'].unique()[0]
        subtitle.append(f"Issue associated with a pull request ({pr_url}).")
    
    # Check if the issue is associated with a project board
    has_project_board = subset_data[subset_data['issue_status_on_project_board'].notna()]
    if not has_project_board.empty:
        # If the issue is associated with a project board, get the status of the issue on the project board and add it to the subtitle
        board_status = subset_data['issue_status_on_project_board'].unique()[0]
        subtitle.append(f"Issue is associated with a project board and has status {board_status}.")
    
    # If no subtitles were added, add a default message
    if not subtitle:
        subtitle = "Issue is not associated with a pull request or a project board."
    
    # Create a line chart for the current issue title
    chart = alt.Chart(subset_data).mark_line(point=True).encode(
        x='issue_date:T', # Use the issue date as the x-axis
        y=alt.Y('title', axis=alt.Axis(title=None, labels=False)), # Use the title as the y-axis and don't show the axis labels or title since we have the title in the chart title
        color='user_login:N', # Color the lines by the user login
        tooltip=[
            'user_login', 'closed_by_login', 'yearmonthdatehoursminutes(issue_date)', 'issue_date_type', 
            'title', 'body', 'state', 'assignee', 'assignees_logins', 'html_url'
        ], # Show the user login, closed by login, issue date, issue date type, title, body, state, assignee, assignees logins, and HTML URL in the tooltip
        opacity=alt.condition(selection, alt.value(1), alt.value(0.1)) # Set the opacity to 1 if the user login is selected and 0.1 if not
    ).add_params(selection).properties(
        # Set the title of the chart to be the issue title and the subtitle to be the subtitles we created
        title=alt.Title(f"Issue: {title}", subtitle=subtitle)
    )
    
    # Append the chart to the list of charts
    charts.append(chart)

# Concatenate all the charts vertically and resolve the x-scale to be shared
alt.vconcat(*charts).resolve_scale(x='shared')

Looking at your issues and project board, it looks like Divya has been a bit more involved here whereas Thea has not been as engaged here. Is that accurate or am I missing something? Again the roles you describe do not easily map to this activity patterns so feel free to briefly share some details about how you've ended up with this distribution. 

Overall, it seems like you're using the issues and project board to help you with your project, but I do notice that many of the issues are being closed immediately after being opened so there might be room to leverage issues more as you complete the project. I realize using a new platform and interface for project management can be difficult, but I would encourage you to think about how you can use these tools to not only help you manage your project and communicate with each other, but also document your work for the final project.

## Group Assignments To Date Feedback

The following feedback is for the three group assignments you have completed so far. Since there isn't clear documentation on who completed what activity, I am currently using the git history to assesss contributions, but I am happy to adjust this assessment if you provide me with more information.

### Mass Digitization & Digital Libraries Assignment

- I believe this assignment is your `brainstorm-doc.md`  but please correct me if I am missing something. Overall on this assignment, I thought you did a great job, though I was a bit confused why you didn’t include some of this research in your documentation for the Mid-Semester Dataset Update? Do you think this context for how you developed the project is not helpful or were you just not sure how to include it? I personally think it would be helpful to include some of this information in your documentation, so I would encourage you to include it in the final project submission and happy to answer questions if you are not sure how to do so.
- I really appreciated your thoughtful comparison between Flash games and ancient hieroglyphs, with emulators as the modern-day Rosetta Stone. This analogy was not only creative but also demonstrated a deep understanding of how digital objects can have historical counterparts, drawing meaningful connections across time.Your exploration of both older and newer preservation methods, such as the Internet Archive’s use of Ruffle and the detailed work of projects like Flashpoint, was excellent. This shows a strong grasp of how digital preservation evolves over time and the importance of both technological and community-led efforts. Again great work and hopefully you can include some of this in your final project documentation!
- Currently I only see A.G. working on this document in the git history, so they will get full marks but I am happy to share the grade between all members once it is confirmed they have all contributed.

Status: Complete & Full Marks for A.G. (Though happy to update this if you can share more details about how you worked together on this assignment)

### Critical Cultural Data Explorations Assignment

- I believe that this assignment is the `cultureasdata-part1.md` and `Part-2-Perspective-Power.txt` files, but again please correct me if I am wrong.
- For `cultureasdata-part1.md` it looks like A.G. and Thea worked on this together, though I’ll be honest that the git history is a bit wonky for this document but this section looks great! In particular, I really appreciated your attention to how dress-up games in the dataset often center white, Euro-centric beauty standards and promote narratives geared towards girls. This insight reflects an excellent awareness of how digital artifacts can convey cultural biases and power structures. Your discussion of these observations shows a nuanced understanding of the implications for representation in digital archives. Also your exploration of the gaps in the dataset, such as the potential survivorship bias and the challenges with tracing the country of origin, was impressive. Highlighting these limitations demonstrates your ability to think critically about what is missing and how it might affect the analysis. It’s clear that you understand how the nature of volunteer-driven archives like Flashpoint can lead to a collection that, while extensive, may lack certain contextual metadata.
- For `Part-2-Perspective-Power.txt` it seems like Ellis and Divya shared the work, and this section also looks great! I really appreciated your attention to how the AI-generated descriptions highlighted technical and preservation aspects but missed deeper engagement with cultural and social dynamics. Your critical analysis of how dress-up games reflect gendered experiences and the cultural context of their creation is insightful, showing an understanding of the complex narratives behind these digital objects. Your exploration of the gaps and biases in archival practices, such as the potential exclusion of games from marginalized creators or those that challenge traditional norms, was particularly strong. This level of reflection demonstrates an excellent grasp of the limitations inherent in digital archives and the importance of questioning the neutrality of data preservation.
- Well done on this assignment overall and again would encourage you to include some of this information in your documentation for the final project submission. Given the shared labor and the quality of submission, every group member will receive the same mark for this assignment.

Status: Complete & Full Marks for All Group Members

### Proprietary & Perspectival Dataset Creation Assignment

- Determining which file represented this assignment was particularly confusing since you have mutliple folders and files with the assignment name. It seems like the assignment is in your `proprietary-perspectival-dataset-creation` folder in the `README.md` file, but please correct me if I am wrong.
- Based on the git history, it looks like A.G. and Thea did the majority of work, with Ellis committing once with an extra line. Is that accurate? If not, please let me know so I can update my assessment.
- Overall, I thought your choice of Instagram as a platform and account of `@illinois1867` was great. Your attention to the legal and ethical considerations of collecting data from a public Instagram account was commendable. Acknowledging the potential issues around informed consent and choosing metrics that are publicly visible demonstrated a strong understanding of the ethical implications of digital research. This shows that your group is not only thinking critically about the data itself but also about the broader context in which it is collected and analyzed. I like the idea of trying to capture seasons but I’m not sure I’m sold on that really being that subjective since you have access to the date of the Instagram post. Also I'm wondering why didn’t you include the post date since that would help make the case for why seasons are subjective on Instagram.
- I'm giving this assignment partial marks because you did capture some data and think critically about the limitations of the platform, but I’m not sold that labeling an instagram post as Spring is that subjective and you did not share the actual dataset but instead screenshots of it. If you can provide more information about how seasons are subjective in this context and share the dataset, I would be happy to give you full marks.

Status: Incomplete & Partial Marks for A.G., Thea, and Ellis (Though happy to update this if you can share more details about how you worked together on this assignment)

## Mid-Semester Dataset Update Feedback

### Dataset Feedback

Overall, the dataset is starting to look promising. I appreciate that you are clearly working together based on the Google Sheets version history. Also it seems like you are already starting to see some interesting patterns in terms of what genders are represented and how many skin tones. I was curious why some of the rows for `operability` remain blank, since you have both `operable`, `non-operable`, and `partially operable` as options. I also appreciated that you included a link to the actual game you were playing which is helpful for someone trying to replicate your results.

However, I do have some concerns on your current approach. First, I thought it was a bit odd and perhaps unnecessary to hand enter some of the fields that already exist in the Flashpoint archive. From looking at the wiki, Flashpoint archive has a number of ways to access their full dataset so I'm confused why you would not use that data? Second, I noticed that you did not include a link to the game in the Flashpoint Archive or the tags you used to find it, which seems like a missed opportunity to document your process and also to help others replicate your work. Relatedly, I'm you might consider including which group member entered which data since some of your assessments seem quite subjective, specifically, the gender field. For the gender field, I'm a bit concerned and confused at the choice to represent gender as a binary in the dataset. We have discussed at length in class the tradeoffs of this approach, so I'm curious why you chose this more reductive representation. Finally, I'm curious about operability and how long it took you to assess this, and whether you could capture the time spent as a field in the dataset to both make your labor more transparent and to make it clear the degree to which you assessed operability. For example, did you play the game for 5 minutes or 5 hours? How thorough are you being in assessing customization options, etc.? You do not have to be comprehensive in your assessments, but if you present your data as comprehensive then that can cause problems.

I would highly recommend for your final project including a more detailed example of your process for playing and assessing a game to help elucidate how you are making these decisions and to help others replicate your work. I would also **highly encourage** you to have a version of your dataset in your GitHub repository, even as you continue adding data in Google Sheets to help you make that versioning more transparent and to help you document your work.

### Documentation Feedback

I appreciated that you created both a [`README.md`](is310-fall-2024-group-1/midpoint-data-docs/README.md) and a [`dataset-documentation.md`](is310-fall-2024-group-1/midpoint-data-docs/dataset-documentation.md) file, and think both are a good idea for the project, but I have some questions and concerns about the documentation.

It looks like Ellis did the entire `README.md`for the folder, whereas A.G., Thea, and Ellis contributed to the `dataset-documentation.md`. You write in that latter file that: 
> "Our documentation is a collective effort between all members of the group, and is updated and revised across various points of the project lifecycle." 
But I'm wondering if that is accurate considering the division of labor I see in the git history. Could you explain that discrepancy? Are you working on the documentation together where one person is writing and the others are reviewing? Or are you dividing up the work in some way?

I'm pressing you on this point because the documentation is a bit confusing. For example, the main folder `README.md` seems to somewhat duplicate what you have in the `dataset-documentation.md` file and some of the formatting of the `dataset-documentation.md` is difficult to read. So it would likely be helpful to spend more time on this documentation to make sure it is accurate and useful, and perhaps that is an area where the group can work together to improve if you are not already doing so.

I did appreciate the inclusion of a rationale for how you are collecting which fields from the Flashpoint Archive, though again I do think you could do some of this programmatically with the available data from the Flashpoint Archive (happy to assist if you are having difficulties!). I also liked that you included details about the columns in the datasets, though I do think operability, gender, and number of skintones could use more detail in terms of how you are evaluating these criteria and what they represent.

I thought your `Collecting Data` section was particularly strong, though I'm concerned that when I tried to duplicate your searches for the tags you listed, I was only able to find results for `Dress Up`, which raises some serious concerns over whether you either actually searched for those tags or documented your process accurately. I would encourage you to be more transparent about your process and to make sure that you are actually following the process you detail in your documentation. For example, you say you tested these keywords iteratively but were is the evidence of that in your dataset or documentation? You could include a list of keywords you decided to exclude and why to help make that process more visible. Remember the goal is not just creating a dataset but doing so responsibly which means considering how others might be able to replicate or understand your processes! 

Overall, though I'm still concerned that even if you document this process more thorougly what you've currently outlined seems to be just randomly selecting games based on how they are returned through the Flashpoint Archive database? For instance, you have collected 20 games in your most recent version of your dataset that seem to cover each year, but do you know how many games there actually are in the archive that would meet your criteria and how those distribute over your time period? When I was doing some basic searching I found that the games with `Dress Up` were not evenly distributed over time, so I'm curious how you are going to address this in your final dataset. I would be happy to help consult on getting the full dataset from the Flashpoint Archive to help you make this determination, and hopefully that context would help you make more informed decisions about what games to include in your dataset.

Finally, I thought your inclusion of technologies you are using and likely future additions helpful, but again had more questions. You mention screen recording tools but what are you screen recording and how will that become part of your dataset? You mention using IDEs and tools for data visualization, but at this point it seems like you have a very small dataset and no code, so curious what you're actually using/planning to use these tools for? Again, while having documentation is crucial, it's also important to think about what is useful and relevant to both your project and future users of your dataset.


### Progress from Initial Proposal Feedback

In the initial proposal, I had asked you to address the following questions in your project (paraphrased):

- [ ] What specific patterns will you analyze in the games (e.g., visual trends, gameplay mechanics)? How will you structure a coding framework to assess gender portrayals? What is your plan for augmenting the dataset with new layers, like coding aesthetic or thematic patterns?
- [ ] For games that are no longer operational, how will you handle these? Will you include information about their degree of operability?
- [ ] How will you evaluate the reliability and comprehensiveness of the archived metadata? Could you use the JSON metadata available for each game instead of web scraping? How might this streamline your data collection?
- [ ] Have you explored the Flashpoint Archive’s wiki and guides for metadata consistency, search functionality, and any controlled vocabularies? How might these resources assist your project?
- [ ] How will you adjust the division of labor now that one group member has dropped the course? How will you assign roles and responsibilities to leverage each member’s expertise? Are there specific tasks each member will focus on?
- [ ] Will you start with a subset of data to test your workflow? If so, how will you determine when you have enough data to meet your project goals? Does your timeline cover all project milestones, and how does it align with the final project requirements?

My current assessment is that some of these concerns are no longer as relevant (you seem less interested in gameplay mechanics for example now) but many are and that while you have completed some of them, many remaing for you to address. In particular, I think you need to do some more work on explicating and refining how you are subjectively assessing gender, skintone, and operability (though a detailed example will likely help on that front). I also think you need to do more work on explaining and documenting your selection process of games, specifically to make sure that you are not just randomly selecting games but are actually following a process that can be replicated and has a compelling rationale. 

I do think you could work a bit more programmatically with the data available from the Flashpoint Archive, but I will leave that to you to decide (though you will have to explain your rationale!). I also think you need to do a bit more work on refining your documentation and how you might more accurately document your group's collaboration, though again happy to respect your listed responsibilities as long as those are confirmed to represent equitable division of labor by all group members. 

I do think you are on the right track with your project, but I think addressing some of the concerns I have outlined above will help you to make sure you are on target to meet the project goals and requirements. I also think bringing in some of our course readings and your work on previous group assignments will also help flush out and situate your project. I would encourage you to reach out to me if you have any questions or concerns about my feedback or how to proceed with the project.

### Final Grade

Your grade for this assignment is currently a B+. I think you are on the right track with your project, but I think you need to some more work. If you can provide a bit more detail on your process for selecting games (speficially, why the process you document doesn't seem to match with the tags I'm seeing), I would be happy to bump you up to an A- for this assignment. And I think if you can address some of the concerns I have outlined above, you will be in a good position to complete the project successfully. Also I'm currently planning to share this grade between all group members, but please let me know if that is not accurate or appropriate.